Code Monkey home page Code Monkey logo

bamqc-legacy's People

Contributors

iainrb avatar morgantaschuk avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bamqc-legacy's Issues

onTarget improvements

Overlap Detection

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L586-L592

The BED interval is defined as [chromStart, chromEnd):

The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

The end point is open, so calling start within the range should be $start < $intervals[$_]{Stop} and not $start <= $intervals[$_]{Stop}

Dead code

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L559
Parameter does not appear in code

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L594-L595
Never used

Unclear use of for loop

I don't understand what this chunk accomplishes
https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L597-L607

  • Why sort the idx variable?
  • The next if ($onTarget) ensures the loop modifies that stats hash only once, so the loop is redundant

runningBaseCoverage bug and improvement

Structure of $stats for reference

stats:
    runningBaseCoverage:
        chr1:
            # Position on chromosome
            1234567:
                # Fragment name $chrom\t$start1\t$start2
                chr1\t1234560\t1234580:    
                    # Final count of reads at this position
                    42

Wrong variable bug

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L723-L724

Given the documentation, I think !$chr should be !chrom

In addition, the above if statement in the for loop does not match the delete if statement that determines if the key should be deleted.

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L753-L755

Lastly, it might be best to move the chromosome checks out of the $pos loop and into the $chr loop. It would make the code cleaner, as the delete statement would no longer need the if statement.

Unique fragments bug

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L729-L732

The issue with fragment naming is explained in oicr-gsi/bamQC#15. The fragment names are defined as "$chrom\t$start1\t$start2" and stored as keys for %startPoints, but due to the bug, two fragments with different start points can have the same fragment name. As the uniqueness of the fragment name is used in
https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L734-L737
this will under-represent the true count.

addRunningBaseCoverage improvements

start1 and start2 documentation is wrong

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L629-L631

start1 is the POS field from the BAM record and start2 is the PNEXT field
https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/bamqc.pl#L116-L120

This means that:

  • start1 is the left most position of the read in question
  • start2 is the left most position of the mate

This means that it is fairly common for start1 and start2 to be equal if the reads are longer than the fragment. From SWID_13812977_GLCS_0001_Lv_R_PE_458_WG_DKT3-9_190311_A00469_0029_AH75JMDRXX_TGTGTTAA-GGCACGGT_L002_001.annotated.bam:

A00469:29:H75JMDRXX:2:1235:15628:27179	163	chr1	4022750	60	14S109M28S	=	4022750	109
A00469:29:H75JMDRXX:2:1235:15628:27179	83	chr1	4022750	60	42S109M=4022750	-109

This has no impact on the validity of the function, as start2 is used as a component of the hash key. This mistake is made throughout the script, but as far as I can see, start2 is never used as a number for calculations.

Deletion counted towards coverage

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L663-L666

Why?

procCigar reversing cigar string

The parsed CIGAR string is reversed if it is on the - strand:
https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L1900

I assume the $strand variable is derived from the read reverse strand (16) flag.

I am not sure how this feeds into downstream calculations, but I think that reversing it is unnecessary and could cause issues:

This thread states that reads from the reverse strands are reverse-complemented by aligners to ensure their orientation matches the reference. This means that:

the CIGAR reports the operations performed on the reference FORWARD strand also if the read comes from the reverse strand
the sequence reported in the SEQ field is the one on the FORWARD strand, even if the original read comes from the reverse strand

So by reversing the CIGAR string:

  • The CIGAR string no longer matches the associated sequence. Any use of the reversed CIGAR string with the sequence in the BAM file requires that the sequence be first reverse complemented
  • The left most position field in the BAM file (POS) is invalidated by reversing the CIGAR string. This would cause issues if any comparisons to the reference were being performed.

By never reversing the CIGAR string, the CIGAR string, read sequence, and reference sequence would always remain in sync.

assess_flag improvement

Many of the suggestions below stem from the fact that assess_flag function uses an ifelse hierarchical way of calling the flags, when the bit flag nature makes no such assumptions.

Mapping count

The mapping count issue is documented in JIRA.

paired reads do not have to be mapped

The paired reads flag is only counted if the read is mapped
https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L160-L161
This does not have to be the case. Flag 1 states that the machine was able to sequence Read 1 and Read 2 from a cluster, not if those reads were mapped. In samtools flagstat that flag is called paired in sequencing

mate unmapped reads Count

The mate unmaped reads count is not strictly true (and misspelled).
https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L162-L163

Currently, this is counted if and only if the sister mate is mapped. To me, if both reads were unmapped, the mate should still be counted as unmapped. This is a little ambiguous and could be solved with the below suggestion.

Introduce samtools flagstat with itself and mate mapped

This will give a count if the read and its mate are both mapped. For the counter to go up, the read should have flag 1, but not flag 4 or 8

With these implementations, the following equalities should hold true:
total reads >= paired reads >= mapped reads >= with itself and mate mapped >= properly paired reads

cigar_stats improvements

Unused arguments

Many variables in this function are never used (I might be missing some perl scoping magic):
https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L388-L400

  • $p parameter
  • $pairStart variable and by extension $chrom, $start1, $start2 variables
  • $posOffset variable

Marking deleted bases

https://github.com/oicr-gsi/bamqc/blob/a3964603efc86334e3345d30c331c9aacab256b2/lib/GSI/bamqc.pm#L444-L449
A deleted base exists on the reference, but not on the read, so there are no bases to map. I don't think $mappedBases should be incremented.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.