bmvdgeijn / wasp Goto Github PK

WASP: allele-specific pipeline for unbiased read mapping and molecular QTL discovery

License: Apache License 2.0

Python 77.66% C 19.52% Shell 0.72% Makefile 0.11% R 2.00%

wasp's Issues

unaligned reads are not properly handled NOW?

Dear Bryce and Graham,

I have recently updated WASP to the last version, but I have come across some brand new issues with unmapped reads (those that contain '*' on the Reference field at the end of the sorted bam file) in the find_intersecting_snps.py script. This didn't happen with my last WASP version.

In particular, the getrname method (also deprecated) at line 582:

cur_chrom = files.input_bam.getrname(read.tid)

is returning -1, so it is crashing at the end because it is out of the 'tid' range. In a future, will it be possible again for find_intersecting_snps.py to handle this reads without need to previously filtering them out? Or am I missing something here?

Many thanks,
Walter

Script Unable to find SNP files

Hi,
When I try to run the script, it is unable to locate the SNP files in the directory I specified. Are there any specific formating requirements for the SNP file beyond those listed in the readme? Are values in the columns separated by commas or spaces? And is a header required?

Thanks!

saw read more times than expected

Hey guys, I'm going through the remapping pipeline and getting the following error

"ValueError: saw read SEQCORE-1829592:151:C7NHAANXX:8:2108:19771:96214 more times than expected in input file"

after running filter_remapped_reads.py. The steps I took were as follows:

# identify intersecting SNPs in MAP1
python $WASP_DIR/mapping/find_intersecting_snps.py \
--is_sorted \
--output_dir ./ \
--snp_tab $PRE.snp_tab.h5 \
--snp_index $PRE.snp_index.h5 \
--haplotype $PRE.haps.h5 \
--samples $INP.IND \
$INP.MAP1.bam

# remap using bowtie to generate MAP2
zcat $INP.MAP1.remap.fq.gz | bowtie2 -p 16 -x hg19 - -S $INP.MAP2.sam
samtools view -bS $INP.MAP2.sam | samtools sort -o $INP.MAP2.bam -
samtools index $INP.MAP2.bam

# run the filter on MAP1 and MAP2 to generate MAP2.keep
python $WASP_DIR/mapping/filter_remapped_reads.py \
$INP.MAP1.to.remap.bam \
$INP.MAP2.bam \
$INP.MAP2.keep.bam

When I look for this read in the remapped output it does show up twice:

$ grep SEQCORE-1829592:151:C7NHAANXX:8:2108:19771:96214 $INP.MAP2.sam
SEQCORE-1829592:151:C7NHAANXX:8:2108:19771:96214.29680282.1.1	0	chr16	29680282	42	126M	*	0	0	AACATAGAAAGACCCTGTCTCTACAAAAACACAAAAAATTAGCCAGGCGTGGTGGTGCATGCCTGTAGTACCAGCTACTTGAGAAGCAGAGGCAGGAGGACTGCTTGAAGCCAGGAGTTTGAGACC	FFFFFFFF<<///<FFFB/FFFB<FF<FFFFFFFF/BF<FFFFFFFFFB/FFFFFFFFFBF/FFB/FF<<//BFFFBFFFFBBBFB//F<BFBFFFF<FFFFF/<FBFF/</BBFBFFFFFBBB/B	AS:i:-3	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:87T38	YT:Z:UU
SEQCORE-1829592:151:C7NHAANXX:8:2108:19771:96214.29680187.1.1	0	chr16	29680187	42	126M	*	0	0	GATCCAGCGCAGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGCAGAGCAGGGAGGATTGCTTGAGTCTAGGAGTTCAAGACCAGCCTGGGTAACATAGAAAGACCCTGTCTCTACAAAAACA	BBBBB/BFFFFFFFBFFFFFFF<FFFFFFFFFFF</FFFFFFFFFF/FBFFBBFFFFFFBFFFFFBF<F//FBFFFF</BFFB<FBFF/F<FBFBFFFBFFFFFFFFFFFFBFBFFFFFFFFBB/B	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:126	YT:Z:UU

any idea what I'm doing wrong or should try to diagnose it? Appreciate any suggestions.

SNP out of order has been skipped

Hi,

When using find_intersecting_snps.py I get thousands of error messages in the format:

SNP out of order has been skipped
75 170054104 100663261 !!!

I have sorted these files by coordinate in samtools, so why am I getting so many of these messages? Also, could you explain the format of these messages for me? For example, I get many of the messages for 170054104, but with different ending positions, e.g.:

75 170054104 33554394 !!!    
SNP out of order has been skipped
75 170054104 100663261 !!!
SNP out of order has been skipped
75 170054104 134217697 !!!
SNP out of order has been skipped
75 170054104 67108838 !!!
SNP out of order has been skipped
75 170054104 50331628 !!!
SNP out of order has been skipped
75 170054104 25165820 !!!
SNP out of order has been skipped
75 170054104 50331637 !!!
SNP out of order has been skipped
75 170054104 106255700 !!!
SNP out of order has been skipped
Finished!

find_intersecting_snps.py discarding ~50% of input reads

Hello Bryce and Graham,

Currently the consortium I'm working for is testing the WASP remapping pipeline for our RNA-Seq dataset.

However, we're running into the issue that when we run the find_intersecting_snps.py script, the resulting output (combination of the keep and to_remap bam files) only contains around half of the reads that we input into the script, going from ~46.5 million to ~25.2 million read pairs. My colleagues from another institute notice the same thing when working on different samples, however to a far lesser degree.

We're quite certain that this effect is not caused by the script discarding reads because of indels, since we do not input any indels in the SNPlist files. When making python report each SNP's checked ptype it only reports "snp" as well.

We're wondering what causes this behavior and how we can possibly solve this?

Thanks in forward,
Joost

adjusting P-values

Hello Bryce and Graham,

First of all, happy New Year!

After running CHT on permuted data, it appears the permuted p-values don't follow null. You suggested either to manually set the overdispersion parameters or to adjust the p-values according to the permuted distribution. I have the following questions:

How should I manually adjust the overdispersion parameter? I was not able to find much reference on manual parameter adjustments.
If going with second option, can I simply run combined_test.py N times with --shuffle flag to obtain N sets of permutation p-values as the null distribution, and calculate calibrated p-values using this null distribution?
I would imagine the first option is faster, while the second option provides more accurate calibration. Which one would you recommend?

Thank you!
Bosh

cross-ref CHT output to peaks

Thanks for putting together such a detailed pipeline description, it was really helpful in getting everything working.

I ran the CHT using SNPs that were inside and proximal to ChIP-seq peaks (I assume the haplotypes of heterozygous carriers of proximal SNPs are tested for imbalance). This means some SNPs get tested twice for multiple peaks. However, the CHT output only lists the test SNP position and not the target peak that it was tested for. Am I screwing something up by testing proximal SNPs? Is there a way to connect the CHT output back to the corresponding peak positions?

GENOMEWIDE.READ.COUNT does not equal the sum of REGION.READ.COUNT

Hello Bryce and Graham,

I notice that the sum of REGION.READ.COUNT is less than GENOMEWIDE.READ.COUNT. Shouldn't they be equal? If not, why? Sorry if I am missing something obvious!

From reading extract_haplotype_read_counts.py, GENOMEWIDE.READ.COUNT is obtained directly from the h5 file, so naively I would assume that some of the reads does not overlap linked regions, therefore GENOMEWIDE.READ.COUNT is greater than the sum of REGION.READ.COUNT.

More importantly, if these two quantities are not equal, how should I calculate GENOMEWIDE.READ.COUNT after permuting REGION.READ.COUNTS (for test calibration)? Thanks!

SNPs in HDF5 files are not recognized

I was able to get WASP to run successfully when my genotypes are stored in text files. However I would like to use HDF5 files for maximum accuracy. I made HDF5 files from VCFs using snp2h5

When I try to run find_intersecting_snps.py I get the following for every chromosome:
starting chromosome 1
reading SNPs from file '/udd/resaf/WASP/WASP/data/HRC_VCF_chr22/snp_tab.h5'
WARNING: chromosome 1 is not in snp_tab.h5 file, assuming no SNPs for this chromosome
processing reads
starting chromosome 10
reading SNPs from file '/udd/resaf/WASP/WASP/data/HRC_VCF_chr22/snp_tab.h5'
WARNING: chromosome 10 is not in snp_tab.h5 file, assuming no SNPs for this chromosome

The HDF5 files seem to have been properly made so I am not sure why the data are not recognized.

Support for multithreading

Are there any plans in the near future to have WASP support multithreading?

Haplotypes missing for run_sim_pe_reads.sh

The file $WASP/example_data/genotypes/chr22.hg19.haplotype.txt.gz is referenced, which does not exist in the repository, and there are no instructions on how to generate this file.

filter_remapped_reads single end pulls two reads at atime

Hi there,

I noticed that as currently written, filter_remapped_reads.py pulls a second read in each loop of the read filter, as if it is in paired end mode.

        #else:
        #    try:
        #        second_read = to_remap_bam.next()
        #        print "what?"
        #    except:
        #        end_of_file=True
        #        break

This bit of code actually seems to be redundant to the lines just above it as far as I can tell?

            if is_paired_end:
                try:
                    orig_read = to_remap_bam.next()
                except:
                    sys.stderr.write("File ended unexpectedly (no pair found).")
                    exit()
                keep_bam.write(orig_read)

If I comment those lines out, the script seems to work as intended on (single end) test data. Let me know if I'm misunderstanding something about the purpose of that code. I pulled freshly today, so I don't think it's an old problem. Cheers!

Noah

SNP file not found

Hi,

I am having problems getting the find_intersecting_snps.py to recognize my snp files. Running a test using chromosome 1 and only 1 SNP my chr1.snps.txt.gz file looks like this:
1070232 G A

I can replicate the error message using the test data, issue_18.bam and chr7.snps.txt.gz, by simply unzipping the snp file, chr7.snps.txt, and running the find_intersecting_snps.py.

However, my chr1.snps.txt.gz file is not being read even when the correct directories are used and the file is zipped. I'm not sure what is causing this.

A copy of the bam, snp and code files used are provided below:
Link to Documets

Any assistance is greatly appreciated.

Correctly re-aligned read pair not included in final mapping output

I've found a read pair that is re-aligned correctly (and reported as such by WASP) yet is not printed to the final output file. I've included files and code necessary to reproduce the bug here: https://dl.dropboxusercontent.com/u/3886457/wasp_bug.tar.gz. I'm using the most recent code from master. See run.sh for the commands and run.outerr has stdout/stderr.

Filtering based on MAPQ after WASP

Hi all,

I implemented WASP to reduce mapping biases and reference biases.
My question is: Is it still recommended to implement an additional filtering step based on MAPQ 10 or 20 of reads after WASP's pipeline or this will cause an underestimation of read counting and excessive reads loss ? In am wondering if the additional filtering I implemented is causing loss of power to detect true expression data.

Best
anto
kings College London

find_intersecting_snp get stuck

Hi Bryce and Graham,

I am running find_intersecting_snp.py to filter a mouse RNAseq dataset mapped onto the GRCm38 reference genome. However, the script somehow got stuck at chromosome 17. More specifically, it has not been printing reads to *.to.remap.bam for the past 24 hours. I ran the job twice and the script got stuck at chromosome 17 both times. I have examined the STAR alignment statistics, and chromosome 17 does not seem to have abnormally large number of reads.

I understand the question is not specific so I am not looking for a direct answer from you. Rather, what debugging steps would you suggest?

Thanks,
Bosh

Recursion failure in Denovo transcriptome

I'm attempting to account for mapping biases in a denovo transcriptome, so I'm working across ~20,000 contigs instead of a small number of chromosomes, and I'm reaching the python recursion limit when I point find_intersecting_snps.py to a SNP directory with 20,000 different files. I guess I could split these into sets of 1000 and check each, but is there any way that you can think of to get around this limit, or any possibility of an iterative rather than recursive implementation that might make things more feasible for non-model applications? Thanks so much for any advice!

AttributeError: 'module' object has no attribute 'openFile'

Hi Graham,

 I met a problem when I use the find_intersecting_snps.py. Could you please tell me how to fix it?

 command line: /home/WASP-master/mapping/find_intersecting_snps.py --is_paired_end --output_dir /home/Allele/test0228 --snp_index /home/Allele/test0228/genotypes/snp_index.h5 --snp_tab /home/Allele/test0228/genotypes/snp_tab.h5 --haplotype /home/Allele/test0228/genotypes/haps.h5 /home/Allele/test0228/2_14.bam

python version: 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2]
pysam version: 0.8.4
Traceback (most recent call last):
File "/mnt/xdlab1/tools/WASP-master/mapping/find_intersecting_snps.py", line 954, in
samples=samples)
File "/mnt/xdlab1/tools/WASP-master/mapping/find_intersecting_snps.py", line 927, in main
haplotype_filename=haplotype_filename)
File "/mnt/xdlab1/tools/WASP-master/mapping/find_intersecting_snps.py", line 67, in init
self.snp_tab_h5 = tables.openFile(snp_tab_filename, "r")
AttributeError: 'module' object has no attribute 'openFile'

And I had download tables-3.3.0 for python.

Thank you so much!

Peaks Files

Quick question: Where do the peaks files come from? See this line in the workflow example. I think I must have missed a step along the way...

Calibrating CHT using permutation

Hello Bryce and Graham,

I would like to calibrate CHT on my dataset using permutation. In the paper supplement you mentioned that you permuted total read depths (and matching genome-wide read depth counts) and randomly flipped allele-specific counts at linked heterozygous SNPs with probability 0.5. I wonder if this can be accomplished by simply flipping the genotypes at every SNP with probability 0.5 in the VCF file? If not, what is the concern and how did you flip the counts? Thank you and happy holidays!

filter_remapped_reads.py. very slow?

Dear owner,
I am implementing wasp mappability filter pipeline in order to improve the mapping. Every thing OK,
however, when running filter_remapped_reads.py it is very slow (more than 10 hours without results, output.bam empty and so I stop it because it is still running.). Possible so slow just for one sample? I do not think so. Here the message I get after stopping the algorithm,

Traceback (most recent call last):
File "filter_remapped_reads.py", line 180, in
main()
File "filter_remapped_reads.py", line 177, in main
options.orig_num_file, options.is_paired_end)
File "filter_remapped_reads.py", line 48, in run
chrm = remap_read.qname.strip().split(":")[1]
KeyboardInterrupt

What the problem is?
Are there new versions of the script that fix the problem ?

many thank for help
antonino
King's College London

I need help please!

I am on step two of mapping and I would like to know what version on python and pysam I should use because I keep getting errors

$ python /Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/find_intersecting_snps.py -p /Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/accepted_hits.bam /Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/SNP_files
Starting on chromosome chr1
Traceback (most recent call last):
File "/Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/find_intersecting_snps.py", line 443, in
main()
File "/Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/find_intersecting_snps.py", line 426, in main
bam_data=Bam_scanner(options.is_paired_end,options.max_window,sort_file_name,keep_file_name,remap_name,remap_num_name,fastq_names,snp_dir)
File "/Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/find_intersecting_snps.py", line 89, in init
self.switch_chr()
File "/Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/find_intersecting_snps.py", line 135, in switch_chr
self.get_next_snp()
File "/Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/find_intersecting_snps.py", line 182, in get_next_snp
snp_line=self.snpfile.readline()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 436, in readline
c = self.read(readsize)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 243, in read
self._read(readsize)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 278, in _read
self._read_gzip_header()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 174, in _read_gzip_header
raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file

Please Please help me! I'm going crazy with downloading and installing python and pysam and still it doesn't work :(

ImportError: No module named libchtslib

Dear owner,

I implemented WASP im my computer various time, and it performs very well, good algorithm.
However, when I run find_intersecting_snps.py (In WASP mappability filter) my account in a server
a get this error:

Traceback (most recent call last):
File "find_intersecting_snps.py", line 6, in
import pysam
File "/gpfs/home/zitoa/ChrX/apps/WASP/pysam/init.py", line 1, in
from pysam.libchtslib import *

ImportError: No module named libchtslib

Why ?? I did not find this module in the pysam folder.

Best,
anto

Problem using rmdup_pe.py

Dear Graham,

I don't seem to be able to get the rmdup_pe.py script to work on a bam file of mine and was wondering if you could maybe have a look.

These are paired end RNA-Seq reads. The bam file was generated following your wasp pipeline using star and the two python scripts available from your site (find_intersecting_snps.py and filter_remapped_reads.py). Reads passing the remapping step were merged with the ones kept from the first round. Up to that point everything seemed to work fine.

When I next ran rmdup_pe.py I got a KeyError from the del keep_cache[keep_read.qname] statement in line 117 of the update_read_cache function. This happens actually very early during the run (second time the function is called), and at that point the keep_cache dictionary is empty.

I first thought the problem was related to the large number of overlapping mates in the bam file (see below), but I then saw that you have taken into account this situation in your script.

First 20 lines of bam:

SRR2972012.826554.1	163	1	629128	255	88M13S	=	629128	88	GGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATCTGTCTCTTATAC	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFF<	NH:i:1	HI:i:1	AS:i:174	nM:i:0
SRR2972012.826554.1	163	1	629128	255	88M13S	=	629128	88	GGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATCTGTCTCTTATAC	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFF<	NH:i:1	HI:i:1	AS:i:174	nM:i:0
SRR2972012.826554.1	83	1	629128	255	13S88M	=	629128	-88	GTATAAGAGACAGGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACAT	BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFBIIIIFFIFIIFIIIIIIIFIFFFFFFFFFFBBB	NH:i:1	HI:i:1	AS:i:174	nM:i:0
SRR2972012.826554.1	83	1	629128	255	13S88M	=	629128	-88	GTATAAGAGACAGGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACAT	BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFBIIIIFFIFIIFIIIIIIIFIFFFFFFFFFFBBB	NH:i:1	HI:i:1	AS:i:174	nM:i:0
SRR2972012.777080.1	163	1	629140	255	76M25S	=	629140	76	CTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATCTGTCTCTTATACACATCTGACGCT	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFBFBBBB<	NH:i:1	HI:i:1	AS:i:150	nM:i:0
SRR2972012.777080.1	163	1	629140	255	76M25S	=	629140	76	CTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATCTGTCTCTTATACACATCTGACGCT	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFBFBBBB<	NH:i:1	HI:i:1	AS:i:150	nM:i:0
SRR2972012.777080.1	83	1	629140	255	25S76M	=	629140	-76	GCTCGGAGATGTGTATAAGAGACAGCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACAT	BFFFBBFFFBBFFFFFFFFFFFFFFBBFFFIIFIIIIIIIFFFFFIFIFIIIIIIIFIIIFIFFFFF<IIFFFFFFFIFFIIIFFFIIFFFFFFFFFFBBB	NH:i:1	HI:i:1	AS:i:150	nM:i:0
SRR2972012.777080.1	83	1	629140	255	25S76M	=	629140	-76	GCTCGGAGATGTGTATAAGAGACAGCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACAT	BFFFBBFFFBBFFFFFFFFFFFFFFBBFFFIIFIIIIIIIFFFFFIFIFIIIIIIIFIIIFIFFFFF<IIFFFFFFFIFFIIIFFFIIFFFFFFFFFFBBB	NH:i:1	HI:i:1	AS:i:150	nM:i:0
SRR2972012.1101183.1	99	1	629142	255	74M27S	=	629142	74	ATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATCTGTCTCTTATACACATCTCCGAGCCC	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFBFBFFFBFFFFFFFFFBFFBFFBBFB	NH:i:1	HI:i:1	AS:i:146	nM:i:0
SRR2972012.1101183.1	99	1	629142	255	74M27S	=	629142	74	ATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATCTGTCTCTTATACACATCTCCGAGCCC	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFBFBFFFBFFFFFFFFFBFFBFFBBFB	NH:i:1	HI:i:1	AS:i:146	nM:i:0
SRR2972012.1101183.1	147	1	629142	255	27S74M	=	629142	-74	GCAGCGTCAGATGTGTATAAGAGACAGATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACAT	BFFBBBBFBBFFFFFFFFFFFFFFFFFFFFIIIFFIIIIIIIIFFIIIIIIIIIIIFIIIIFIIIFFBIFFBFBIFFFFFIIIFIIIIFFFFFFFFFFBBB	NH:i:1	HI:i:1	AS:i:146	nM:i:0
SRR2972012.1101183.1	147	1	629142	255	27S74M	=	629142	-74	GCAGCGTCAGATGTGTATAAGAGACAGATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACAT	BFFBBBBFBBFFFFFFFFFFFFFFFFFFFFIIIFFIIIIIIIIFFIIIIIIIIIIIFIIIIFIIIFFBIFFBFBIFFFFFIIIFIIIIFFFFFFFFFFBBB	NH:i:1	HI:i:1	AS:i:146	nM:i:0
SRR2972012.609108.1	163	1	629792	255	101M	=	629914	223	TTCCACAGAAGCTGCCATCAAGTATTTCCTCACGCAAGCAACCGCATCCATAATCCTTCTAATAGCTATCCTCTTCAACAATATACTCTCCGGACAATGAA	BBBFFFFFFFFFFIIIIIIIIIFFIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIIIIIIIFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFBBFFB	NH:i:1	HI:i:1	AS:i:200	nM:i:0
SRR2972012.609108.1	163	1	629792	255	101M	=	629914	223	TTCCACAGAAGCTGCCATCAAGTATTTCCTCACGCAAGCAACCGCATCCATAATCCTTCTAATAGCTATCCTCTTCAACAATATACTCTCCGGACAATGAA	BBBFFFFFFFFFFIIIIIIIIIFFIIIIIIIIIIIIIIIIIIIIIIIIIIFFFIIIIIIIIIFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFBBFFB	NH:i:1	HI:i:1	AS:i:200	nM:i:0
SRR2972012.407929.1	99	1	629862	255	101M	=	629912	151	CTCTTCAACAATATACTCTCCGGACAATGAACCATGACCAATACTACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAAT	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFFIIIIIIIIIIIIIIFFFFFFFFFFBFFFBFBBBFFB0<B<B<<<BB####	NH:i:1	HI:i:1	AS:i:196	nM:i:2
SRR2972012.407929.1	99	1	629862	255	101M	=	629912	151	CTCTTCAACAATATACTCTCCGGACAATGAACCATGACCAATACTACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAAT	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFFIIIIIIIIIIIIIIFFFFFFFFFFBFFFBFBBBFFB0<B<B<<<BB####	NH:i:1	HI:i:1	AS:i:196	nM:i:2
SRR2972012.422938.1	99	1	629864	255	101M	=	629913	150	CTTCAACAATATACTCTCCGGACAATGAACCATGACCAATACTACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAG	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIFFFBFFFFFFFFFFFFFFFFFFFFFFBBFFF<	NH:i:1	HI:i:1	AS:i:196	nM:i:2
SRR2972012.422938.1	99	1	629864	255	101M	=	629913	150	CTTCAACAATATACTCTCCGGACAATGAACCATGACCAATACTACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAG	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIFFFBFFFFFFFFFFFFFFFFFFFFFFBBFFF<	NH:i:1	HI:i:1	AS:i:196	nM:i:2
SRR2972012.117983.1	99	1	629867	255	101M	=	629914	148	CAACAATATACTCTCCGGACAATGAACCATGACCAATACTACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCC	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFBFFFFFFBFBFF	NH:i:1	HI:i:1	AS:i:196	nM:i:2
SRR2972012.117983.1	99	1	629867	255	101M	=	629914	148	CAACAATATACTCTCCGGACAATGAACCATGACCAATACTACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCC	BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFBFFFFFFBFBFF	NH:i:1	HI:i:1	AS:i:196	nM:i:2

I've tried to bypass the error using a try/except statement and found that the same exeption is raised many times during the full run (158,848 times).

In the end 1,577,452 out of 1,909,388 reads are kept, although the script does not report discarding any reads:

DISCARD reads:
  unmapped: 0
  mate unmapped: 0
  improper pair: 0
  different chromosome: 0
  secondary alignment: 0
  missing pairs (e.g. mismatched read names): 0
  not paired: 0
  duplicate pairs: 0
KEEP reads:
  pairs: 788726

For a comparison, I also ran picard MarkDuplicates on the same file, which kept only 724,208 reads (out of the 1,909,388), so less than half as many.

I have quite a large number of similar bam files to process, and so I would like to get the script to handle them properly. As I do not fully understand the algorithm, it would would be great if you could have a look. If you want me to upload the full bam or a larger part thereof please just let me know.

Thanks a lot in advance,

Maik

snp2h5 gives ERROR: snptab.c:126: failed to write record to SNP table

Hi there,

using WASP/snp2h5 I have just encountered the folowing error which I don't seem to be able to make sense of:


[mce@dzne-go-cn03 test]$ /home/mce/WASP/snp2h5/snp2h5 --chrom /home/mce/genomes/mouse/mm10_chromInfo.txt --format vcf --haplotype h5_files/haplotypes.h5 --snp_index h5_files/snp_index.h5 --snp_tab h5_files/snp_tab.h5 chromosome_vcf_files/C57BL_6JxCAST_EiJ_chr*.vcf.gz
long alleles will be truncated to 100bp
writing haplotypes to: h5_files/haplotypes.h5
writing SNP index to: h5_files/snp_index.h5
writing SNP table to: h5_files/snp_tab.h5
chromosome: chr10, length: 130694993bp
reading from file chromosome_vcf_files/C57BL_6JxCAST_EiJ_chr10.vcf.gz
counting lines in file
  total lines: 1269364
reading VCF header
  VCF header lines: 63
  number of samples: 1
initializing HDF5 matrix with dimension: (1269301, 2)
parsing file and writing to HDF5 files
..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 140597336254272:
  #000: H5Tcompound.c line 254 in H5Tget_member_type(): unable register datatype atom
    major: Datatype
    minor: Unable to register new atom
  #001: H5I.c line 895 in H5I_register(): can't insert ID node into skip list
    major: Object atom
    minor: Unable to insert object
  #002: H5SL.c line 995 in H5SL_insert(): can't create new skip list node
    major: Skip Lists
    minor: Unable to insert object
  #003: H5SL.c line 687 in H5SL_insert_common(): can't insert duplicate key
    major: Skip Lists
    minor: Unable to insert object
ERROR: snptab.c:126: failed to write record to SNP table

Here is the beginning of the vcf file:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]">
##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]">
##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) [250]">
##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) [0]">
##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) [20]">
##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) [5]">
##FILTER=<ID=Qual,Description="Minimum value of the QUAL field [10]">
##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) [0]">
##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps [3]">
##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) [0]">
##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered [2]">
##FILTER=<ID=RefN,Description="Reference base is N []">
##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) [5]">
##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) []">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled genotype posterior probabilities">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##FORMAT=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="Number of high-quality non-reference bases">
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-fwd, ref-reverse, alt-fwd and alt-reverse bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##FORMAT=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Total Number of high-quality ref-fwd, ref-reverse, alt-fwd and alt-reverse bases">
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type from Ensembl 78 as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino
_acids|Codons|Existing_variation|DISTANCE|STRAND">
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=MT,length=16299>
##contig=<ID=X,length=171031299>
##contig=<ID=Y,length=91744698>
##ALT=<ID=X,Description="Represents allele(s) other than observed.">
##samtoolsVersion=1.1+htslib-1.1
##bcftools_callVersion=1.1+htslib-1.1
##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
##source_20141009.1=vcf-annotate(r953) -f +/D=400/d=5/q=20/w=2/a=5/ (BTBR_T+_Itpr3tf_J,ST_bJ)
##QUAL=<ID=QUAL,Number=1,Type=Float,Description="The highest QUAL value for a variant location from any of the samples.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  C57BL_6JxCAST_EiJ
10      3101242 rs387754637     T       TAA     94      PASS    .       GT      0|1
10      3101362 rs33880920      G       C       228     PASS    .       GT      0|1
10      3102112 rs29321579      T       C       228     PASS    .       GT      0|1
10      3102661 rs29370426      C       G       228     PASS    .       GT      0|1
10      3103479 rs108493055     A       G       228     PASS    .       GT      0|1
10      3103831 rs33849371      T       C       228     PASS    .       GT      0|1
10      3103836 rs584544245     T       C       227     PASS    .       GT      0|1
10      3103895 rs29364631      T       C       228     PASS    .       GT      0|1
10      3103925 rs578285186     T       C       228     PASS    .       GT      0|1
10      3104351 rs387537244     A       AC      39.4943 PASS    .       GT      0|1
10      3104745 rs241787200     A       G       228     PASS    .       GT      0|1
10      3104903 rs29350251      A       G       228     PASS    .       GT      0|1
10      3104909 rs580341507     G       A       228     PASS    .       GT      0|1
10      3105426 rs108035459     T       C       228     PASS    .       GT      0|1
10      3105799 rs234511968     A       C       201     PASS    .       GT      0|1
10      3105800 rs217261475     A       T       190     PASS    .       GT      0|1
10      3105801 rs245173074     C       T       217     PASS    .       GT      0|1
10      3105856 rs579919330     A       G       228     PASS    .       GT      0|1
10      3105858 rs583575549     A       G       228     PASS    .       GT      0|1
10      3106780 rs581394925     A       G       228     PASS    .       GT      0|1
10      3106917 rs256461417     C       T       195     PASS    .       GT      0|1
10      3107123 rs223250688     A       G       228     PASS    .       GT      0|1
10      3107640 rs29345216      T       A       228     PASS    .       GT      0|1
10      3107674 rs240873984     A       G       228     PASS    .       GT      0|1
10      3107749 rs256303459     C       T       228     PASS    .       GT      0|1
10      3107904 rs246030398     A       G       228     PASS    .       GT      0|1
10      3107917 rs233488944     A       G       228     PASS    .       GT      0|1
10      3107937 rs262503241     T       C       228     PASS    .       GT      0|1
10      3108077 rs218305584     A       C       228     PASS    .       GT      0|1
10      3108112 rs29378080      T       C       228     PASS    .       GT      0|1
10      3108231 rs29346229      C       T       228     PASS    .       GT      0|1
10      3108450 rs247930443     T       C       228     PASS    .       GT      0|1
10      3108510 rs387013722     A       AACAGACACACAC   115     PASS    .       GT      0|1
10      3108855 rs107696935     G       C       228     PASS    .       GT      0|1
10      3108985 rs223682752     A       T       228     PASS    .       GT      0|1
10      3109160 rs578781673     T       C       195     PASS    .       GT      0|1
10      3109188 rs587296554     G       C       199     PASS    .       GT      0|1
10      3109189 rs583582949     A       C       201     PASS    .       GT      0|1
10      3116973 rs587319011     C       A       76      PASS    .       GT      0|1
10      3117386 rs49605811      T       C       228     PASS    .       GT      0|1
10      3117445 rs266123009     C       CCA     228     PASS    .       GT      0|1
10      3117578 rs48009659      C       T       228     PASS    .       GT      0|1
10      3118041 rs262864509     T       A       228     PASS    .       GT      0|1
10      3118135 rs387629506     G       GT      228     PASS    .       GT      0|1
10      3118156 rs29328552      G       A       228     PASS    .       GT      0|1
10      3118664 rs235884734     C       A       228     PASS    .       GT      0|1

By the time the error is raised 4.5 MB have been written to haplotypes.h5.

Does anyone know what that error means and what to do about it? Any help will be much appreciated.

Thanks,

mce

find_intersecting_snps.py encounters AttributeError when operating on paired-end reads

I've run into a puzzling error when applying find_intersecting_snps.py to some paired-end data. The script produces this output:

SAMPLES: ['sample1']
command line: /lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py --is_sorted --output_dir /data/t2d-progenitor/ATAC-seq/PE_ATAC --snp_tab /data/t2d-progenitor/h5/snp_tab.h5 --snp_index /data/t2d-progenitor/h5/snp_index.h5 --haplotype /data/t2d-progenitor/h5/haplotypes.h5 --samples sample1 /data/t2d-progenitor/ATAC-seq/PE_ATAC/PE_ATAC.sort.bam --is_paired_end
prefix: /data/t2d-progenitor/ATAC-seq/PE_ATAC/PE_ATAC.sort
reading reads from:
  /data/t2d-progenitor/ATAC-seq/PE_ATAC/PE_ATAC.sort.bam
writing output files to:
  /data/t2d-progenitor/ATAC-seq/PE_ATAC/PE_ATAC.sort.remap.fq1.gz
  /data/t2d-progenitor/ATAC-seq/PE_ATAC/PE_ATAC.sort.remap.fq2.gz
  /data/t2d-progenitor/ATAC-seq/PE_ATAC/PE_ATAC.sort.remap.single.fq.gz
  /data/t2d-progenitor/ATAC-seq/PE_ATAC/PE_ATAC.sort.keep.bam
  /data/t2d-progenitor/ATAC-seq/PE_ATAC/PE_ATAC.sort.to.remap.bam
starting chromosome chr1
reading SNPs from file '/data/t2d-progenitor/h5/snp_tab.h5'
reducing 344856 SNPs on chromosome chr1 to 214199 positions that are polymorphic in sample of 1 individuals
processing reads
Traceback (most recent call last):
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 930, in <module>
    samples=samples)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 909, in main
    samples=samples)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 622, in filter_reads
    if read.next_reference_name is None:
AttributeError: 'pysam.calignmentfile.AlignedSegment' object has no attribute 'next_reference_name'
Closing remaining open files:/data/t2d-progenitor/h5/snp_index.h5...done/data/t2d-progenitor/h5/snp_tab.h5...done/data/t2d-progenitor/h5/haplotypes.h5...done

It seems some expected attribute of the data is not there? Could this be caused by missing information in my input BAM file?

mapping problem for processing GEUVADIS data

Hello, I tried to play with WASP and run it on one of the GEUVADIS sample using python find_intersecting_snps.py . It finished sorting the bam files but stuck when it says "Starting on chromosome 1". The error message is the following:

Can you please help me take a look? Thank you very much!

Traceback (most recent call last):
File "find_intersecting_snps.py", line 456, in
main()
File "find_intersecting_snps.py", line 449, in main
bam_data.empty_slot_paired()
File "find_intersecting_snps.py", line 260, in empty_slot_paired
seq1s=self.check_for_snps(read,0)
File "find_intersecting_snps.py", line 299, in check_for_snps
for cigar in read.cigar:
TypeError: 'NoneType' object is not iterable

Paired end reads are not supported in rmdup.py

Dear Sir,

The script rmdup.py only supports single end reads, while this is not stated anywhere. It may nice to have this functionality implemented, or at least a warning about this in the documentation.

Thank you for your time,
Adriaan

Array cannot be safely cast to required type

Hi Graham,

I am trying to run the wasp pipeline on a sorted bam file I have, and am getting the error

TypeError: array cannot be safely cast to required type

The last line of output is

reading SNPs from file 'SNPs/chr2.snps.txt.gz'

The file chr2.snps.txt.gz only has entry in it, if that has any importance.

I pulled the latest version of wasp yesterday, so it should be up to date.
The full options I am running are

python find_intersecting_snps.py --is_sorted [input-bam] --snp_dir [dir]--output_dir ./

Sum of reads breakdown not equal to total reads

I checked the output log file for find_instersecting_snps. But when I noticed that the sum of indels, excess overlapping SNPs, keep, remap, and excess allelic combinations is larger than the total number of reads in the bam file. Any ideas why this happen?

Mapping Step 2

The output files are empty but I receive no error message or anything. I have no idea why my accepted_hits.quality.remap.num.gz and remap.fq.gz files are empty

Okay to include PEER covariants?

Hi Graham,

The final combined_test.py can take principle components into account. Would it be okay to use PEER factors, age, gender and other covariates?

Bosh

StopIteration exception in filter_remapped_read.py

Hi there,

For some reason, I'm getting a StopIteration exception at line 137 of filter_remapped_read.py. The iterator is reaching the end of the file without the script expecting it (I checked and the last read really is reached). If I handle the exception and set end_of_file to true (like appears in lots of other places in the script), then the code works fine (I think--at least it outputs results that seem reasonable). Is there any reason that the exception wasn't handled in this case (and I've done something wrong or my files are off...), or is this just a case that didn't come up in testing maybe? Thanks so much for any help, traceback is pasted below!

Traceback (most recent call last):
File "/home/noahrose/bin/WASP/mapping/filter_remapped_reads.py", line 182, in
main()
File "/home/noahrose/bin/WASP/mapping/filter_remapped_reads.py", line 179, in main
options.orig_num_file, options.is_paired_end)
File "/home/noahrose/bin/WASP/mapping/filter_remapped_reads.py", line 137, in run
orig_read = to_remap_bam.next()
File "pysam/calignmentfile.pyx", line 1561, in pysam.calignmentfile.AlignmentFile.next (pysam/calignmentfile.c:16453)
StopIteration

Failed to parse genotype likelihoods

Hi,

When I do this step:
./snp2h5/snp2h5 --chrom data/ucsc/hg19/chromInfo.txt.gz
--format vcf
--geno_prob geno_probs.h5
1000G/supporting/genotype_likelihoods/shapeit2/ALL.chr*.gl.vcf.gz

An error occurres:
ERROR: vcf.c:311: failed to parse genotype likelihoods from string '0,-2.96423' (line: 90250)

This error occurs on the chromosome X (and Y), the GLs of some SNPs in male samples are only two figures(such as '0,-2.96423'). May I ask how do you solve this problem? Can I delete this data?

Thank you so much!

find_intersecting_snps.py encounters IndexError when checking read for ref or alt allele

When running find_intersecting_snps.py, I got the following output:

SAMPLES: ['sample1']
command line: /lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py --is_sorted --output_dir /data/t2d-progenitor/CHIP-seq/PE_HNF6 --snp_tab /data/t2d-progenitor/h5/snp_tab.h5 --snp_index /data/t2d-progenitor/h5/snp_index.h5 --haplotype /data/t2d-progenitor/h5/haplotypes.h5 --samples sample1 /data/t2d-progenitor/CHIP-seq/PE_HNF6/PE_HNF6.sort.bam
prefix: /data/t2d-progenitor/CHIP-seq/PE_HNF6/PE_HNF6.sort
reading reads from:
  /data/t2d-progenitor/CHIP-seq/PE_HNF6/PE_HNF6.sort.bam
writing output files to:
  /data/t2d-progenitor/CHIP-seq/PE_HNF6/PE_HNF6.sort.remap.fq.gz
  /data/t2d-progenitor/CHIP-seq/PE_HNF6/PE_HNF6.sort.keep.bam
  /data/t2d-progenitor/CHIP-seq/PE_HNF6/PE_HNF6.sort.to.remap.bam
starting chromosome chrM
reading SNPs from file '/data/t2d-progenitor/h5/snp_tab.h5'
WARNING: chromosome chrM is not in snp_tab.h5 file, assuming no SNPs for this chromosome
processing reads
starting chromosome chr1
reading SNPs from file '/data/t2d-progenitor/h5/snp_tab.h5'
reducing 344856 SNPs on chromosome chr1 to 214199 positions that are polymorphic in sample of 1 individuals
processing reads
Traceback (most recent call last):
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 930, in <module>
    samples=samples)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 909, in main
    samples=samples)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 666, in filter_reads
    max_seqs, max_snps)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 801, in process_single_read
    snp_read_pos)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 391, in count_ref_alt_matches
    if ref_alleles[i] == read.query[read_pos[i]-1]:
IndexError: string index out of range
Closing remaining open files:/data/t2d-progenitor/h5/snp_index.h5...done/data/t2d-progenitor/h5/snp_tab.h5...done/data/t2d-progenitor/h5/haplotypes.h5...done

Deducing that the problem occurred in the count_ref_alt_matches function, I altered it to print some more information:

def count_ref_alt_matches(read, read_stats, snp_tab, snp_idx, read_pos):
    ref_alleles = snp_tab.snp_allele1[snp_idx]
    alt_alleles = snp_tab.snp_allele2[snp_idx]

    for i in range(len(snp_idx)):
        print('read.query: ' + read.query)
        print('len(read.query): ' + str(len(read.query)))
        print('read_pos[i]: ' +  str(read_pos[i]))
        if ref_alleles[i] == read.query[read_pos[i]-1]:
            # read matches reference allele
            read_stats.ref_count += 1
        elif alt_alleles[i] == read.query[read_pos[i]-1]:
            # read matches non-reference allele
            read_stats.alt_count += 1
        else:
            # read matches neither ref nor other
            read_stats.other_count += 1

This of course printed many lines as it iterated over reads, but the last few were:

read.query: ACACAACTTGCGGAATGCAATAAAATTATTTTGCCACCTCTCCCTTTAAA
len(read.query): 50
read_pos[i]: 28
read.query: CCCCTGGCTTCCACATGCTGAGTTTTCAGTCTCCCATCTCAAAGGCCCCA
len(read.query): 50
read_pos[i]: 13
read.query: CATAAAAAAAAAACAAAAAAAGACTGAAAAATATAGGTAT
len(read.query): 40
read_pos[i]: 41
Traceback (most recent call last):
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 933, in <module>
    samples=samples)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 912, in main
    samples=samples)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 669, in filter_reads
    max_seqs, max_snps)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 804, in process_single_read
    snp_read_pos)
  File "/lab/aaylward-shared/WASP/mapping/find_intersecting_snps.py", line 394, in count_ref_alt_matches
    if ref_alleles[i] == read.query[read_pos[i]-1]:
IndexError: string index out of range
Closing remaining open files:/data/t2d-progenitor/h5/snp_index.h5...done/data/t2d-progenitor/h5/snp_tab.h5...done/data/t2d-progenitor/h5/haplotypes.h5...done

It seems the error occurs because the distance from the beginning of the read to the variant is longer than the total length of the read. I suspect this is not an issue with my data, because under version 0.1 find_intersecting_snps.py handled this input file without errors. I'm curious about how you think this situation should be handled.

Error in bam2h5.py

Hello Bryce and Graham,

I am using the CHT to map eQTLs. Step 1 and 4 was completed without any problem. But the last bit of step 5 gives the following error. I could not pinpoint the origin of this error so was wondering if you could give me a pointer. Would be happy to provide example data if needed. Thanks!

Begining error message

setting statistics for each chromosome
Traceback (most recent call last):
File "/srv/persistent/bliu2/tools/WASP/CHT/bam2h5.py", line 696, in
main()
File "/srv/persistent/bliu2/tools/WASP/CHT/bam2h5.py", line 684, in main
chromstat.set_stats(h5f, chrom_list)
File "/srv/persistent/bliu2/tools/WASP/CHT/chromstat.py", line 100, in set_stats
chrom_stat.set_from_vals(node[:])
File "/srv/persistent/bliu2/tools/anaconda/lib/python2.7/site-packages/tables/array.py", line 648, in getitem
arr = self._read_slice(startl, stopl, stepl, shape)
File "/srv/persistent/bliu2/tools/anaconda/lib/python2.7/site-packages/tables/array.py", line 754, in _read_slice
self._g_read_slice(startl, stopl, stepl, nparr)
File "hdf5extension.pyx", line 1491, in tables.hdf5extension.Array._g_read_slice (tables/hdf5extension.c:14091)
tables.exceptions.HDF5ExtError: HDF5 error back trace

File "H5Dio.c", line 173, in H5Dread
can't read data
File "H5Dio.c", line 550, in H5D__read
can't read data
File "H5Dchunk.c", line 1872, in H5D__chunk_read
unable to read raw data chunk
File "H5Dchunk.c", line 2902, in H5D__chunk_lock
data pipeline read failed
File "H5Z.c", line 1382, in H5Z_pipeline
filter returned failure during read
File "H5Zdeflate.c", line 125, in H5Z_filter_deflate
inflate() failed

End of HDF5 error back trace

Problems reading the array data.

End of error message

Bosh

Multiple entries in remap.fq*.gz for single read pair

I've used the mapping part of WASP successfully before, but for some reason I've started seeing what seems to be a bug where the sequence for a single read pair is written to the remap.fq*.gz files multiple times:

$ zcat test.remap.fq1.gz 
@1:chr12:6643710:6643710:3
TCCGATCTGCCGCATCTTCTTTTGCGTCGCCAGCCGAGCCACATCGCTCAGACACCATGGGGAAGGTGAAGGTCGGAGTCAACGGATTTGGTCGTATTGG
+1:chr12:6643710:6643710:3
FFB/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB
@1:chr12:6643710:6643710:3
TCCGATCTGCCGCATCTTCTTTTGCGTCGCCAGCCGAGCCACATCGCTGAGACACCATGGGGAAGGTGAAGGTCGGAGTCAACGGATTTGGTCGTATTGG
+1:chr12:6643710:6643710:3
FFB/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB
@1:chr12:6643710:6643710:3
TCCGATCTGCCGCATCTTCTTTTGCGTCGCCAGCCGAGCCACATCGCTGAGACACCATGGGGAAGGTGAAGGTCGGAGTCAACGGATTTGGTCGTATTGG
+1:chr12:6643710:6643710:3
FFB/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB

I've made a zip with files to reproduce the bug. I used the latest version of find_intersecting_snps.py from master.

https://dl.dropboxusercontent.com/u/3886457/wasp_bug.zip

error in bam2h5 set_stats

Hey guys, I'm having an issue with the latest master release ( f6e1d55 ) running bam2h5. After it goes through all of the chromosomes I get the following error:

setting statistics for each chromosome
Traceback (most recent call last):
  File "WASP/WASP-master/CHT/bam2h5.py", line 696, in <module>
    main()
  File "WASP/WASP-master/CHT/bam2h5.py", line 684, in main
    chromstat.set_stats(h5f, chrom_list)
  File "WASP/WASP-master/CHT/chromstat.py", line 100, in set_stats
    chrom_stat.set_from_vals(node[:])
  File "/home/ag284/.local/lib/python2.7/site-packages/tables/array.py", line 659, in __getitem__
    arr = self._read_slice(startl, stopl, stepl, shape)
  File "/home/ag284/.local/lib/python2.7/site-packages/tables/array.py", line 765, in _read_slice
    self._g_read_slice(startl, stopl, stepl, nparr)
  File "tables/hdf5extension.pyx", line 1522, in tables.hdf5extension.Array._g_read_slice (tables/hdf5extension.c:16883)
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5Dio.c", line 173, in H5Dread
    can't read data
  File "H5Dio.c", line 551, in H5D__read
    can't read data
  File "H5Dchunk.c", line 1875, in H5D__chunk_read
    unable to read raw data chunk
  File "H5Dchunk.c", line 2905, in H5D__chunk_lock
    data pipeline read failed
  File "H5Z.c", line 1372, in H5Z_pipeline
    filter returned failure during read
  File "H5Zdeflate.c", line 125, in H5Z_filter_deflate
    inflate() failed

End of HDF5 error back trace

Problems reading the array data.
Closing remaining open files

This is what I'm calling, and the BAM file was processed using the WASP re-mapping pipeline:

python $WASP_DIR/CHT/bam2h5.py \
--chrom $WASP_DIR/example_data/chromInfo.hg19.txt  \
--snp_index $CUR.snp_index.h5 \
--snp_tab   $CUR.snp_tab.h5 \
--haplotype $CUR.haps.h5 \
--samples IND \
--individual $IND \
--ref_as_counts $OUT.ref_as_counts.h5 \
--alt_as_counts $OUT.alt_as_counts.h5 \
--other_as_counts $OUT.other_as_counts.h5 \
--read_counts $OUT.read_counts.h5 \
$BAM

I do have some random contigs in the bam (chr17_ctg5_hap1, chrUn_gl000249, etc) and I get some CIGAR I/D errors while bam2h5 is running, could that be the issue?

Thanks for your help!

Chimeric reads and filter_remapped_reads.py

I've been putting some .bam files through the WASP mapping pipeline and I have hit into a problem. When I put my remapped reads through filter_remapped_reads.py, I get the following error:

ValueError: saw read NS500144:884:H5CMVBGX3:3:13602:8091:6268 more times than expected in input file

It suggests that the original .bam used paired-end reads, so I re-ran the original .bam through find_intersecting_snps.py with --is_paired_end. The resting output gave two very small paired .fastqs and a single fastq file that is as large as the fastq file created when --is_paired_end was in its default single state.

After remapping using BWA mem, the .bam file from the paired ends gave the following error when put through filter_remapped_reads.py:

keep_reads: 0 bad_reads: 0 discard_reads: 1197916

And using the single fastq gave the same ValueError as before. After probing the .bam files, we did find repeated IDs:

NS500144:884:H5CMVBGX3:3:13602:8091:6268 2064 chr11 119889737 59 35M40H * 0 0 CTTTGACGCGTATGACAAATCCGTGAGCTCTTGGA E/EEEEEEEAAEEEEEEEEEEEE/EEEEEEE/EEE NM:i:1 MD:Z:21T13 AS:i:30 XS:i:0 SA:Z:chr11,119889741,-,33S42M,60,1; NS500144:884:H5CMVBGX3:3:13602:8091:6268 16 chr11 119889741 60 33S42M * 0 0 CTTTGACGCGTATGACAAATCCGTGAGCTCTTGGACGCGTATGACAAATCCGTGAGCTCTTGGAGTAGAAGATAA E/EEEEEEEAAEEEEEEEEEEEE/EEEEEEE/EEEEEEE6EEEEEEEEEEEE/EEEEEEEEEEEEEEEEEAAAAA NM:i:1 MD:Z:17T24 AS:i:37 XS:i:0 SA:Z:chr11,119889737,-,35M40S,59,1;

We suspect that this fault is due to a chimeric read. Is this the case, and is there a way to accommodate this in the WASP pipeline?

Thanks,

Anamay

Memory error

Hi Graham,

I'm trying to run the find_intersecting_snps.py script but it reserves unto 50G of memory for one BAM. Is this normal? Also the --samples are those the samples in my bam file or samples in the 1000G file used to create the SNP files?

Segfault running snp2h5

Hi,
I'm getting a segfault and a core dump when I run snp2h5. It looks to me like there may be a bug with how the pairs of hap andimpute2 files are associated. The first chromosome that is parsed runs correctly (e.g., chr10), but the second produces an error. I think snp2h5 is not updating to the correct hap file when it cycles to the next chromosome...

Let me know if there is anything I can do to help.

-- Nick

~/data/external_datasets/cbrown_impute2$ ~/source/WASP/snp2h5/bin/./snp2h5 --chrom /home/ngcrawford/data/genomes/chrmInfo.trunc.txt --format impute --geno_prob geno_probs.h5 --snp_index snp_index.h5 --snp_tab snp_tab.h5 --haplotype haps.h5 5M.hg19.RNAseq.d3_unrelated.filtered.o.chr_.impute2.gz 5M.hg19.RNAseq.d3_unrelated.filtered.o.chr_.impute2_haps.gz
writing genotype probabilities to: geno_probs.h5
writing haplotypes to: haps.h5
writing SNP index to: snp_index.h5
writing SNP table to: snp_tab.h5
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr10.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr11.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr12.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr13.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr14.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr15.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr16.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr17.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr18.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr19.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr1.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr20.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr21.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr22.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr2.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr3.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr4.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr5.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr6.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr7.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr8.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr9.impute2.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr10.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr11.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr12.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr13.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr14.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr15.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr16.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr17.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr18.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr19.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr1.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr20.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr21.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr22.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr2.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr3.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr4.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr5.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr6.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr7.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr8.impute2_haps.gz'
'5M.hg19.RNAseq.d3_unrelated.filtered.o.chr9.impute2_haps.gz'
chromosome: chr10, length: 135534747bp
reading from file 5M.hg19.RNAseq.d3_unrelated.filtered.o.chr10.impute2.gz
counting lines in file
total lines: 1461156
number of samples: 162
initializing HDF5 matrix with dimension: (1461156, 486)
parsing file and writing to HDF5 files
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
initializing HDF5 matrix with dimension: (1461156, 324)
reading from file 5M.hg19.RNAseq.d3_unrelated.filtered.o.chr10.impute2_haps.gz
parsing file and writing to HDF5 files
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
chromosome: chr11, length: 135006516bp
reading from file 5M.hg19.RNAseq.d3_unrelated.filtered.o.chr11.impute2.gz
counting lines in file
total lines: 1640404
number of samples: 162
initializing HDF5 matrix with dimension: (1640404, 486)
parsing file and writing to HDF5 files
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
initializing HDF5 matrix with dimension: (1640404, 324)
reading from file 5M.hg19.RNAseq.d3_unrelated.filtered.o.chr10.impute2_haps.gz
parsing file and writing to HDF5 files
WARNING: snp2h5.c:775: snp at position 20000105 present in impute2_haps file but not in impute2 file
WARNING: snp2h5.c:775: snp at position 20000144 present in impute2_haps file but not in impute2 file
WARNING: snp2h5.c:775: snp at position 20000279 present in impute2_haps file but not in impute2 file

[... A LARGE NUMBER OF LINES ...]

WARNING: snp2h5.c:775: snp at position 135006472 present in impute2_haps file but not in impute2 file
ERROR: Segmentation fault (core dumped)

Step 3 Remapping Issue with remap.fq.gz files

Step two finished with no errors except my remap.fq.gz files appear to be filled with ~~~~~ and I'm not sure why this is. I tried moving on to step 3 in remapping but receive this error: IndexError: string index out of range. There is nothing wrong with TopHat, the genome, or the annotation files. The error appears to be something wrong with my fq.gz files that were generated during step 2. Any suggestions or thoughts would be greatly appreciated. thank you

Here is my command:
./tophat -o /Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out --no-coverage-search -p 6 -G /Users/olneykimberly/Desktop/Homo_sapiens_UCSC_hg18/Homo_sapiens/UCSC/hg18/Annotation/Archives/archive-2014-06-02-13-47-56/Genes/genes.gtf /Users/olneykimberly/Desktop/Homo_sapiens_UCSC_hg18/Homo_sapiens/UCSC/hg18/Sequence/Bowtie2Index/genome /Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/accepted_hits.remap.fq1.gz,/Users/olneykimberly/Desktop/TopHat/tophat-2.0.13.OSX_x86_64/hg18_out/accepted_hits.remap.fq2.gz

[2015-04-03 10:05:45] Beginning TopHat run (v2.0.13)

[2015-04-03 10:05:45] Checking for Bowtie
Bowtie version: 2.2.4.0
[2015-04-03 10:05:45] Checking for Bowtie index files (genome)..
[2015-04-03 10:05:45] Checking for reference FASTA file
[2015-04-03 10:05:45] Generating SAM header for /Users/olneykimberly/Desktop/Homo_sapiens_UCSC_hg18/Homo_sapiens/UCSC/hg18/Sequence/Bowtie2Index/genome
Traceback (most recent call last):
File "./tophat", line 4088, in
sys.exit(main())
File "./tophat", line 3942, in main
params.read_params = check_reads_format(params, reads_list)
File "./tophat", line 1840, in check_reads_format
freader=FastxReader(zf.file, params.read_params.color, zf.fname)
File "./tophat", line 1585, in init
while hlines>0 and self.lastline[0] not in "@>" :
IndexError: string index out of range

'AttributeError' in filter_remapped_reads.py

Hello, it appears that there's a mismatch in the filter_remapped_reads.py main() function, where you add the argument "to_remap_bam", but when passing it to the run() function it is referred to as it's older name, "options.orig_bam". This causes an AttributeError when running:

python filter_remapped_reads.py -p DP9-HT10_quality.to.remap.bam DP9-HT10_quality_mapw.bam DP9-HT10_quality.remap.keep.bam DP9-HT10_quality.to.remap.num.gz
Traceback (most recent call last):
  File "filter_remapped_reads.py", line 158, in <module>
    main()
  File "filter_remapped_reads.py", line 154, in main
    run(options.orig_bam, options.remap_bam, options.keep_bam,
AttributeError: 'Namespace' object has no attribute 'orig_bam'

It looks like this issue was introduced in commit f850b49d1dfb81014e46899685a0f660af71bee0, when you refactored the code. Thank you!

reads with flipped alleles

Hi,
I have read some information that flipped alleles denote A/T or G/C SNPs, but I don't know why only these reads need to remap. Could you please give me illustration more clearly?

Thank you very much!

AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'next_reference_name'

Hi Graham,

I got this error when running rmdup_pe.py. My pysam distribution is 0.9.1.4 (latest).

This error is specific to v0.2.1 as v0.1 can run without problem.

Bosh

Step 3: Remapping Issue with remap.fq.gz files using GEM

Hi ,
I used remap.fq1.gz files and used remap.fa2.gz to do step 3. However, it seemed that the fq.gz was not correct. The error information was following:

Second pair input file found: reads_2.fastq.gz

ERROR: Failed to initialize neccessary parameters:

Unable to get pairing information from input.
Please check your read id's and make sure its either in casava >= 1.8 format or the
ids end with /1 and /2

Can you give me some suggestion?
Thanks.

get_target_regions.py "Index out of range" error

Hi Graham,
I am having issues with get_target_regions.py script and one sample (the last one in our sample file). Everything works fine for 24 of our 25 samples and then I get this:

individual: lgs100524
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
individual: lgs302096
chr1
Traceback (most recent call last):
File "/usr/local/analysis/wasp/0.2.1-a48f02c/bin/get_target_regions.py", line 495, in
main()
File "/usr/local/analysis/wasp/0.2.1-a48f02c/bin/get_target_regions.py", line 396, in main
combined_files.add_counts(chrom_list, count_files, snp_files, ind_idx)
File "/usr/local/analysis/wasp/0.2.1-a48f02c/bin/get_target_regions.py", line 132, in add_counts
a_hap = hap_tab[:, hap_a_idx]
File "/usr/lib/python2.7/dist-packages/tables/array.py", line 647, in getitem
startl, stopl, stepl, shape = self._interpret_indexing(key)
File "/usr/lib/python2.7/dist-packages/tables/array.py", line 399, in _interpret_indexing
raise IndexError("Index out of range")
IndexError: Index out of range
Closing remaining open files:./combined_read_count.h5...done../../HDF5_formatted_files/snp_tab.h5...done./combined_ref_count.h5...done../../Bam2h5_files/H3K4me1//alt_as_counts.lgs302096.h5...done../../HDF5_formatted_files/haplotypes.h5...done./combined_alt_count.h5...done../../HDF5_formatted_files/snp_index.h5...done./combined_het_count.h5...done../../Bam2h5_files/H3K4me1//ref_as_counts.lgs302096.h5...done../../Bam2h5_files/H3K4me1//read_counts.lgs302096.h5...done./combined_as_count.h5...done

This sample (302096) doesn't produce any errors in any of the mapping routines. So, I thought maybe the problem was with the HDF5 SNP files (specifically the haplotype file from the error). However, when I look at the vcf formatted genotype and haplotype files that were used to create the HDF5 SNP files, there doesn't seem to be any issue (the sample is present in the genotype and haplotype files and there seems to be sufficient information present).

Do you have any thoughts? What additional info can I send to you that might help"?

Thanks so much!

Intersecting SNPs

I am seeing lot of warnings that says "SNP out of order has been skipped" while using "find_intersecting_snps.py". Can you please explain what the warning means and is it okay to ignore?

Thank you

Reads with soft clipping

Hi,

Thank you for making this great software available! I encountered what appears to be an issue with handling soft clipped reads that I wanted to bring to your attention. Below is an example of a read that causes this issue. Note that the cigar string is 23S77M. It is these type of reads, where the alignment begins with a soft clip, which seem be the problem.

HWI-ST1113:347:H99VNADXX:1:1207:16477:32946 4 chr10 75531 0 23S77M * 0 0 AAATGTGTAGCAGTAGGAAAGCTCCCTTGAGAACTGGCACAAGACAAGGATGGCCTCTCTCACCACTCCTATTCAACGTAGTACTGGAATTTCTGGCCAG CCCFFBDDHHHHHGHIJIIJJJJJJJJJJIJJJIJJJJJJIJEIIJJJJJJJJJJJJIJJJIJJJJJJJJHHHHHHFFEFD?CEEEDDDDDDDECCCDDD NM:i:0 MD:Z:77 AS:i:77 XS:i:77

The SNP involved here is: 75552 C A
It is on chromosome 10 (in the file “chr10.snps.txt.gz”)

find_intersecting_snps.py correctly identifies this read as overlapping the SNP, and writes it *.to.remap.bam. However, in *.remap.fq.gz, the wrong position is flipped. Here’s the corresponding line in *.remap.fq.gz:

@9:chr10:75530:1
AAATGTGTAGCAGTAGGAAAGATCCCTTGAGAACTGGCACAAGACAAGGATGGCCTCTCTCACCACTCCTATTCAACGTAGTACTGGAATTTCTGGCCAG
+9:chr10:75530:1
CCCFFBDDHHHHHGHIJIIJJJJJJJJJJIJJJIJJJJJJIJEIIJJJJJJJJJJJJIJJJIJJJJJJJJHHHHHHFFEFD?CEEEDDDDDDDECCCDDD

Comparing the original read to the flipped read we can see the flipped position (in brakets []):
AAATGTGTAGCAGTAGGAAAG[C]TCCCTTGAGAACTGGCACAAGACAAGGATGGCCTCTCTCACCACTCCTATTCAACGTAGTACTGGAATTTCTGGCCAG AAATGTGTAGCAGTAGGAAAG[A]TCCCTTGAGAACTGGCACAAGACAAGGATGGCCTCTCTCACCACTCCTATTCAACGTAGTACTGGAATTTCTGGCCAG

However, the SNP, and thus the position that should have been flipped, is actually here (in squiggly brackets {}):
AAATGTGTAGCAGTAGGAAAG[C]TCCCTTGAGAACTGGCACAAGA{C}AAGGATGGCCTCTCTCACCACTCCTATTCAACGTAGTACTGGAATTTCTGGCCAG AAATGTGTAGCAGTAGGAAAG[A]TCCCTTGAGAACTGGCACAAGA{C}AAGGATGGCCTCTCTCACCACTCCTATTCAACGTAGTACTGGAATTTCTGGCCAG

The flipped position is 22 bp from the start of the read. Notably, the SNP (and position that should have been flipped) is 22 bp from the start of the matched (i.e.“M”) portion of the read. The first 23 positions in the read are soft clipped (23S77M), and thus the SNP is actually 22+23 = 45 bp from the start of the read. It seems that the problem is: to arrive at the correct SNP position, the distance should be counted from start of the M portion of the read, whereas the script seems to count from the absolute beginning of the read.

It does seems that the script is designed to handle soft clipped reads, as line 313 is:
if cigar[0] in (0,4): #if CIGAR indicates a match alignment to the reference genome or a soft clipping

Any insights would be greatly appreciated.

Thank you!

Best,

Dave

The snp directory problem.

Hi,

I'm Jia, a postdoc from UNC Charlotte. I wonder to know in the directory <SNP_file_directory> in the Step 2 of Mapping. Does it include all snps from all chromosomes or just heterozygous snps from all chromosomes?

Thanks.

bmvdgeijn / wasp Goto Github PK

wasp's Issues

ImportError: No module named libchtslib

Begining error message

End of error message

[2015-04-03 10:05:45] Beginning TopHat run (v2.0.13)

Recommend Projects

Recommend Topics

Recommend Org