Hello Bryce and Graham, Currently the consortium I'm working for is

find_intersecting_snps.py discarding ~50% of input reads about wasp HOT 8 OPEN

bmvdgeijn commented on July 29, 2024

find_intersecting_snps.py discarding ~50% of input reads

from wasp.

Comments (8)

gmcvicker commented on July 29, 2024

Hi Joost,

Can you try running again (at least on one sample) using the latest version of WASP, which I just released on github a few days ago. You can either pull from the master branch or release tag v0.2.1. The new version fixes a couple of bugs related to paired-end reads, and find_intersecting_snps.py now outputs a report that summarizes why reads are being discarded. If you re-run find_intersecting_snps.py could you then provide the output of this report? It should look like this (only with more counts):

DISCARD reads:
  improper pair: 0
  different chromosome: 0
  indel: 0
  secondary alignment: 0
  excess overlapping snps: 0
  excess allelic combinations: 0
  missing pairs (e.g. mismatched read names): 0
KEEP reads:
  single-end: 0
  pairs: 0
REMAP reads:
  single-end: 2
  pairs: 0

One possible issue is that if you have a large number of samples (i.e. large number of SNPs) and long reads, the number of SNPs overlapping a single read can be very high. The new version allows you to use haplotypes present in your samples rather than all possible allelic combinations of overlapping SNPs, which reduces the number of possible reads that need to be considered. To run the pipeline in this way you will need to regenerate the SNP HDF5 files using the snp2h5 program that is provided, and then provide --haplotype --snp_index --snp_tab arguments to find_intersecting_snps.py

Depending on how you plan to use the output from the WASP mapping step, you may also be able to retain more reads with the new version of WASP. You can do this by only using SNPs that are polymorphic in a given sample using the --samples argument. For example you could provide --samples GM12878 to tell find_intersecting_snps.py only consider SNPs that are polymorphic in the sample GM12878. I think this approach will work well if you are just planning to use allele-specific reads. If you plan to use the full CHT test (with read depth + allele-specific reads) I am not sure this is a good idea, however, as I think that it can add noise or bias to the read depth portion of the test.

from wasp.

jverlouw commented on July 29, 2024

Hello Graham,

Thanks for mentioning the new version, unfortunately I completely missed the release somehow before I made the post.

However, I'm running into an issue for both the new h5 mode as well as the "old" text based snplist. See below error snippet:

starting chromosome 1 reading SNPs from file '/path/to/snp_tab.h5' processing reads Traceback (most recent call last): File "/apps/software/WASP/0.2.1-Python-2.7.9/mapping/find_intersecting_snps.py", line 930, in <module> samples=samples) File "/apps/software/WASP/0.2.1-Python-2.7.9/mapping/find_intersecting_snps.py", line 909, in main samples=samples) File "/apps/software/WASP/0.2.1-Python-2.7.9/mapping/find_intersecting_snps.py", line 651, in filter_reads max_snps) File "/apps/software/WASP/0.2.1-Python-2.7.9/mapping/find_intersecting_snps.py", line 688, in process_paired_read indel_idx, indel_read_pos = snp_tab.get_overlapping_snps(read) File "/apps/software/WASP/0.2.1-Python-2.7.9/mapping/snptable.py", line 392, in get_overlapping_snps elif seq_type == BAM_CHARD_CLIP: NameError: global name 'seq_type' is not defined

If I look into the snptable.py I see that for all the CIGAR options 'op' is defined, but for the last two 'seq_types' is used. Might this have something to do with it? I'm not well versed in python, so I might be completely wrong too ofcourse!

Regards,
Joost

from wasp.

gmcvicker commented on July 29, 2024

Hi Joost,

It sounds like this is a bug with the new version of the code. I will look into it. Thanks for letting me know.

Graham

from wasp.

gmcvicker commented on July 29, 2024

Hi Joost,

I committed a fix for this to the head and development branches:
475f923

Please pull the latest from the head branch and let me know if you run into any further problems.

One concern I have is that problem implies that some of your read alignments have hard-clipping, padding, or an unknown cigar code. I haven't tested mapped reads with hard-clipping or padding, but in principle the code should work for them. What mapper are you using?

Thanks,

Graham

from wasp.

jverlouw commented on July 29, 2024

Hello Graham,

The bam files that I currently use in the analysis are aligned using Hisat. However, your mentioning of hard clipping made me think. The specific bam files that i have been using are also processed using GATK's SplitNTrim and Indel realigner.
About one million of the reads seem to give incorrect coordinate errors, which i presume is caused by the realigner. This still does not account for a lot of the missing reads however.
Might the hard clipping by SplitNTrim actually cause the reads to be discarded silently perhaps?

I'll test the remapping on a "fresh" bam file during the weekend, so I hope that will solve most of the missing reads.

Regards,
Joost

from wasp.

jverlouw commented on July 29, 2024

Hello Graham,

While trying to get the new HDF5 conversion to work I've run into a few problems.
First off, when trying to snp2h5 our entire VCF file the program will run into a segmentation fault very quickly.

When running it per chromosome however, and this is the main problem, it behaves like it's trying to read chromosome 3 twice, see below:

chromosome: 2, length: 243199373bp reading from file chr2.snps.r5.3.monoAllelicOnly.vcf counting lines in file total lines: 1667577 reading VCF header VCF header lines: 104 number of samples: 499 initializing HDF5 matrix with dimension: (1667473, 998) parsing file and writing to HDF5 files ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... chromosome: 3, length: 198022430bp reading from file chr3.snps.r5.3.monoAllelicOnly.vcf counting lines in file total lines: 1409701 reading VCF header VCF header lines: 104 number of samples: 499 initializing HDF5 matrix with dimension: (1409597, 998) parsing file and writing to HDF5 files ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. chromosome: 3, length: 198022430bp reading from file chr4.snps.r5.3.monoAllelicOnly.vcf counting lines in file total lines: 1404545 reading VCF header VCF header lines: 104 number of samples: 499 initializing HDF5 matrix with dimension: (1404441, 998) HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread 0: #000: ../../src/H5Ddeprec.c line 169 in H5Dcreate1(): unable to create dataset major: Dataset minor: Unable to initialize object #001: ../../src/H5Dint.c line 431 in H5D_create_named(): unable to create and link to dataset major: Dataset minor: Unable to initialize object #002: ../../src/H5L.c line 1640 in H5L_link_object(): unable to create new link to object major: Links minor: Unable to initialize object #003: ../../src/H5L.c line 1863 in H5L_create_real(): can't insert link major: Symbol table minor: Unable to insert object #004: ../../src/H5Gtraverse.c line 964 in H5G_traverse(): internal path traversal failed major: Symbol table minor: Object not found #005: ../../src/H5Gtraverse.c line 759 in H5G_traverse_real(): traversal operator failed major: Symbol table minor: Callback failed #006: ../../src/H5L.c line 1675 in H5L_link_cb(): name already exists major: Symbol table minor: Object already exists ERROR: snp2h5.c:470 failed to create dataset

I noticed this as well when running the entire VCF, it would start with the chromosome: 3, length: 198022430bp first.

Might this have something to do with the chrominfo file? There is no duplicate chromosome 3's in there though.

Regards,
Joost

from wasp.

gmcvicker commented on July 29, 2024

Hi Joost,
I've been looking into this problem, but so far I have not been able to reproduce it with my test VCF files. How did you run the program with a combined VCF (e.g. what was the input filename)? The program expects to have a single file per chromosome and should not be able to run on files that do not contain the chromosome name.

Would it be possible for you to share the files (or a small version of them) that are causing the problem with me (the smaller the files the better).

Thanks,

Graham

from wasp.

gmcvicker commented on July 29, 2024

I think that the segfault problem may have been a mistake in writing an error message indicating that a SNP's position is outside of a chromosome's range (this can happen if SNPs from several chromosomes are mixed in the same file or there is an assembly version mismatch). I have corrected this and committed the change to the master branch (422ef94). I am still not sure what is causing the repeated chr3 problem you are experiencing.

from wasp.

find_intersecting_snps.py discarding ~50% of input reads about wasp HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent