Code Monkey home page Code Monkey logo

Comments (22)

bradtaylor avatar bradtaylor commented on August 16, 2024

@d-cameron I would like to help with this issue. Can you provide a test case that would display the behavior you are observing?

from picard.

vdauwera avatar vdauwera commented on August 16, 2024

@d-cameron Would you be able to provide a data snippet that reproduces the issue? Otherwise we'll have to close this issue since we can't do anything without a test case. Thanks!

from picard.

d-cameron avatar d-cameron commented on August 16, 2024

Sorry for the long delay. I've put together some minimal test cases (seq and base qualities not relevant and only included as CollectInsertSizeMetrics complained when the seq wasn't there).

In the first test case, the library could be RF or FR. The SAM specs allow TLEN with any sign for either read in the pair - they can even be both positive or negative! My desired behavior is for these ambiguous fragments to be assigned RF or FR based on the rest of the library.
insert_size_metrics_ambiguous_test.sam

@HD VN:1.5  GO:none SO:coordinate
@SQ SN:ref  LN:200
frf1    83  ref 100 255 10M =   100 10  AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr
frf1    163 ref 100 255 10M =   100 -10 AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr
frf2    163 ref 100 255 10M =   100 10  AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr
frf2    83  ref 100 255 10M =   100 -10 AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr
frf3    83  ref 100 255 10M =   100 10  AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr
frf3    163 ref 100 255 10M =   100 -10 AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr

This second case is what I actually encountered when aligning the DREAM Somatic Mutation Challenge data set with bowtie2 but without first performing adapter trimming. This data set is known to be FR. Although simple comparison of the alignment is ambiguous, the soft clipping in the CIGAR shows that the alignments are FR. Soft clipping on the other ends would imply RF.
insert_size_metrics_ambiguous_test.sam

@HD VN:1.5  GO:none SO:coordinate
@SQ SN:ref  LN:200
@HD VN:1.5  GO:none SO:coordinate
@SQ SN:ref  LN:200
frf1    83  ref 100 255 5S5M    =   100 5   AAAAAAAAAAA AAAAAAAAAAA zz:Z:ambiguous_rf_or_fr_but_should_be_fr_if_adapter_not_clipped_before_alignment
frf1    163 ref 100 255 5M5S    =   100 -5  AAAAAAAAAAA AAAAAAAAAAA zz:Z:ambiguous_rf_or_fr_but_should_be_fr_if_adapter_not_clipped_before_alignment
frf2    83  ref 100 255 4S6M    =   100 6   AAAAAAAAAAA AAAAAAAAAAA zz:Z:ambiguous_rf_or_fr_but_should_be_fr_if_adapter_not_clipped_before_alignment
frf2    163 ref 100 255 6M4S    =   100 -6  AAAAAAAAAAA AAAAAAAAAAA zz:Z:ambiguous_rf_or_fr_but_should_be_fr_if_adapter_not_clipped_before_alignment
fr1 163 ref 100 255 10M =   101 11  AAAAAAAAAAA AAAAAAAAAAA zz:Z:fr
fr1 83  ref 101 255 10M =   100 -11 AAAAAAAAAAA AAAAAAAAAAA zz:Z:fr

This third case is the superset of the previous one and also occurs in the DREAM data set. Small fragments that read into adapter sequence get over-aligned when there is an adapter microhomology with the reference next to the alignment position. This causes small FR fragments to appear to be RF. This entire library should be FR.
insert_size_metrics_adapter_microhomology_test.sam

@HD VN:1.5  GO:none SO:coordinate
@SQ SN:ref  LN:200
frf1    83  ref 100 255 5S5M    =   100 5   AAAAAAAAAAA AAAAAAAAAAA zz:Z:ambiguous_rf_or_fr_but_should_be_fr_if_adapter_not_clipped_before_alignment
frf1    163 ref 100 255 5M5S    =   100 -5  AAAAAAAAAAA AAAAAAAAAAA zz:Z:ambiguous_rf_or_fr_but_should_be_fr_if_adapter_not_clipped_before_alignment
frf2    83  ref 100 255 4S6M    =   100 6   AAAAAAAAAAA AAAAAAAAAAA zz:Z:ambiguous_rf_or_fr_but_should_be_fr_if_adapter_not_clipped_before_alignment
frf2    163 ref 100 255 6M4S    =   100 -6  AAAAAAAAAAA AAAAAAAAAAA zz:Z:ambiguous_rf_or_fr_but_should_be_fr_if_adapter_not_clipped_before_alignment
fr1 163 ref 100 255 10M =   101 11  AAAAAAAAAAA AAAAAAAAAAA zz:Z:fr
microhomology   83  ref 100 255 4S6M    =   101 6   AAAAAAAAAAA AAAAAAAAAAA zz:Z:actually_a_fr_pair_with_soft_clip_adapter_with_one_bp_microhomology_with_reference_causing_a_5bp_FR_fragment_to_be_mapped_as_a_6bp_RF
fr1 83  ref 101 255 10M =   100 -11 AAAAAAAAAAA AAAAAAAAAAA zz:Z:fr
microhomology   163 ref 101 255 5M5S    =   100 -6  AAAAAAAAAAA AAAAAAAAAAA zz:Z:actually_a_fr_pair_with_soft_clip_adapter_with_one_bp_microhomology_with_reference_causing_a_5bp_FR_fragment_to_be_mapped_as_a_6bp_RF

from picard.

vdauwera avatar vdauwera commented on August 16, 2024

Thanks @d-cameron, we'll try to look at these soon.

from picard.

vdauwera avatar vdauwera commented on August 16, 2024

@bradtaylor Are you still interested in looking into this?

from picard.

nh13 avatar nh13 commented on August 16, 2024

@vdauwera assume he is not.

from picard.

nenewell avatar nenewell commented on August 16, 2024

I'm looking at this issue, and I have some questions about the direction to take. I'm sorry this is so long.

Some background:

CollectInsertSizeMetrics accepts only the second read of a pair and uses the logic in SamPairUtil.getPairOrientation() to determine the orientation of the pair using the "second in pair" and "reverse strand" elements of the FLAG field, the signed TLEN, and the POS and PNEXT fields. The upshot of the logic is that, as long as one pair is unambiguously right or left of its mate and the mapper gives all these fields the correct values consistent with the SAM standard (in particular, the sign of TLEN is consistent with the "second in pair" flag - always negative for the second segment), then pairs where the second segment is on the forward strand are always classified as RF, while pairs with second segments on the reverse strand are classified as FR. This seems fine.

d-cameron's issue comes from cases where there is left/right ambiguity, where the spec doesn't apply and the sign of TLEN is apparently undefined, as well as cases where the read overlaps the adapter. His first test file contains 3 read pairs, all of which have ambiguous pair orientation because their leftmost map positions are the same. CollectInsertSizeMetrics call the first and third pairs as RF, because the read in each pair that is specified as second in the FLAG is also specified as forward strand and has a negative TLEN. In the second read pair, the read specified as second in the FLAG is marked forward strand, but the pair is called as FR by the program because the sign of TLEN is positive - the positive sign for the second read is apparently legal because the sign standard requires unambiguous left/right assignment.

d-cameron suggests that the pair orientations of ambiguous cases like these be determined based on the orientation of pairs in the rest of the library. So I think he is suggesting that we use the pair orientations of non-ambiguous pairs in the file to determine the orientations of ambiguous pairs. The natural question about this strategy is that since the program is designed to handle SAM files that contain pairs of both orientations simultanesously, how do we determine the pair orientation for a file if both orientations are unambiguously present? Would this be an issue? Or is it almost always the case that a library contains pairs of just one orientation? Since the program's code is designed to process records individually, this change would presumably require substantial restructuring (probably using more than one pass).

One simpler alternative would be to create a fourth category for ambiguous reads and create statistics for this category as well. Another would be to just select FR as the pair orientation for ambiguous reads with shorter TLENs (say <~500), and RF otherwise, since shorter paired-end fragments are apparently usually FR, as opposed to the longer mate pair fragments which are RF. IMO the best solution of all would be to have the mappers determine the orientation unequivocally, and communicate it solely through the FLAGs for each record, dropping the +/- sign of TLEN. Or better still, add an int field for each record that indicates segment position in the template - this would also handle future cases where templates will apparently have more than two segments, and it would preserve the quick visual ID of segment order that we now have with the TLEN sign convention. But I guess we have to work with what we have...

d-cameron's third test file is a superset of the second file and contains 4 read pairs which are known to be FR. The first two pairs are ambiguous. The program calls them as RF because the FLAG indicates that the second reads are on the forward strand, and TLEN is negative for these reads. The program calls the third read, which is not ambiguous, as FR, although I think this is through luck, because the record appears to contain inconsistent information - see my question below. The fourth read presents a non-ambiguous case in which the read overlaps the adapter:

             101 105
            .--F->sssss  
        ssss<--R--  
            100

Here the reverse segment should extend only to 101, but it extends to 100, into the clipping, due to microhomology between the adapter and the reference. This causes an unambiguous assignment of RF due to the shift in the leftmost position. d-cameron would like this to be called as FR. So he is asking us to detect and correct the the effect of the segment's overlap into the adapter. In this case I think removing the overlap would convert the read into an ambiguous case which would be handled using whatever method is decided for those cases. But how would we correct the overlaps? Trimming programs like the recent Skewer (which specializes in paired-end reads) search for the adapter sequences in the FastQ files and remove them. But in these cases we expect to have only a small fraction of the adapter present in the read. Is there another way to do this? If one member of the pair is longer than the other, just trim it to the shorter length? But isn't this adapter/overlap issue likely to cause problems for other programs as well? Shouldn't the adapters really be trimmed before mapping?

I have two questions about the files themselves. The first question involves the first two read pairs in the third test file (frf1 and frf2). The pairs are ambiguous, but d-cameron states that the soft clipping reveals the underlying FR orientation. How does the soft clipping do this? Judging from the FLAG values and the CIGARS, the first pair should map like:

          100 104
          --F->sssss  (FLAG 163)
     sssss<--R-       (FLAG 83)

Why does this mean that the read is FR?

The second question involves the third read pair in the third test file (fr1), which apparently describes a non-ambiguous case. From the positions, the CIGARs, and the forward/reverse entries in the FLAGs, the pair looks like this:

 100       109
 -----F---->.     (FLAG 163)
 .<-----R----     (FLAG 83)
  100       110

d-cameron labels this as FR and the program calls it as FR, so there appears to be no issue. But the FLAGs indicate the forward read is the second in the pair and the reverse read is the first, which I don't understand, because the forward read is left of the reverse one. So the SAM file seems to be inconsistent for this pair. How can the mapper output inconsistent information for an apparently unambiguous pair?

Thanks in advance for any input on these questions.

from picard.

d-cameron avatar d-cameron commented on August 16, 2024

Thanks for looking into this. My particular use case comes from using CollectInsertSizeMetrics ina variant calling pipeline to automatically determine the library orientation which, for known FR libraries with small fragment sizes, gives an answer that does not match the actual library composition.

fr1 in the third test case is FR. The first read in the pair happens come from the Crick (negative) strand, but the strand and orientation are what is expected for a FR library (align to different strands, both reads point towards to the other read, sane fragment size). Some RNA protocols results in libraries in which the first read is known to come from the sense or missense strand but there'd only be value in splitting out the two possible FR strand alignment when aligning to a transcript database since for alignments to reference genomes, you don't know which strand the RNA came from anyway.

Why does this mean that the read is FR?

When raising I was under the impression that short fragments in RF mate pair libraries soft clip in a different manner but you are correct, disambiguating through soft clip orientation would only work for FF.

Another would be to just select FR as the pair orientation for ambiguous reads with shorter TLENs (say <~500), and RF otherwise, since shorter paired-end fragments are apparently usually FR, as opposed to the longer mate pair fragments which are RF

At least for Illunima Nextera mate pair, overlapping RF fragments are not even possible at all (Figure 2). All overlapping fragment would either be FR fragments, or discordant alignments. Does there currently exist any library preps/sequencing technologies for which RF overlap is a valid alignment?

In this case I think removing the overlap would convert the read into an ambiguous case which would be handled using whatever method is decided for those cases. But how would we correct the overlaps?

I've handled these in my downstream analysis by looking for adapter homology between the soft clip position, and the alignment position of the other read. Unfortunately this requires knowing the adapter sequence. Other options are to ignore them (since RF overlap is invalid), class as ambiguous, or to give FR a few bp leeway.

how do we determine the pair orientation for a file if both orientations are unambiguously present?

Currently CollectInsertSizeMetrics split up to the output and reports per-orientation metrics. A file with both FR and RF reads already outputs metrics for both the FR and RF subsets.

from picard.

nenewell avatar nenewell commented on August 16, 2024

d-cameron thanks very much for your informative reply. I'm new to this stuff and I need to think more about it, but I think your answer to the fr1 question is already improving my understanding - I think I was operating with the understanding that the forward read was always on the Watson strand, and this seems to be incorrect...

On the question about determining pair orientation for the file, I do know that the program outputs per-orientation metrics, but I was trying to understand how we would decide on an orientation for the library to assign to the ambiguous reads if the file contains both orientations. Hmmm... this was based on the assumption that the set of all reads in a file constitutes the library at issue, which is probably also incorrect...maybe reads from multiple libraries exist in a file, indicated by a field like read group...so we could look for an unambiguous read with the same entry for that field...you can see how new I am - thanks again.

from picard.

d-cameron avatar d-cameron commented on August 16, 2024

CollectInsertSizeMetrics aggregates at multiple levels and it's up to the user to specify how they wish to aggregate. Disambiguation should be agnostic to what the source of the reads it is being asked to aggregate is: CollectInsertSizeMetrics should do the best job it can with the data it has been given. If the user has specified file level metrics, CollectInsertSizeMetrics shouldn't look at read group information, it should aggregate at file level as requested. In the scenario where two libraries have not been barcoded and are physically mixed together before sequencing (yes, there are data sets out there like this) then there won't even been a RG tag to separate them.

For this particular issue, I think the key insight is that, at least for Illumina mate pair libraries, the library preparation process means RF reads cannot overlap (short fragments come out FR). Philosophically, I don't know the approach Picard tools wishes to take: design tools for existing data (hard-code no RF overlap), or take a more theoretically pure approach that would handle arbitrary read alignments (ie overlapping RF from either new chemistry or sequencing technology).

from picard.

nenewell avatar nenewell commented on August 16, 2024

I'd like to propose a fix for this issue. I've coded and tested it, but before I submit it as pull requests to picard and htsjdk, I'd like to solicit comments as to whether the strategy is acceptable. In the current code, InsertSizeMetricsCollector.acceptRecord() selects the second read from each pair, and SamPairUtils.getPairOrientation() determines the pair orientation using the read positions, forward/reverse strand FLAGs, and the signed TLEN. The ambiguous pair issue arises because the sign of TLEN is not well defined by the spec when the pair is completely overlapping, with the leftmost positions the same for each segment, so in d-cameron's examples completely overlapping reads from an FR library can be effectively called as RF by the aligner, via the specification of the FLAGs and the sign of TLEN, resulting in CollectInsertSizeMetrics specifying the pair as RF.

I think the best way around this issue is to stop relying on the signed TLEN for pair orientation calls. This can be done by accepting the reverse read of each pair for processing instead of the second read of the pair. If we process only the reverse reads, we don't need to use the signed TLEN to determine the relative positions of the 5' ends of the reads, because the 5' position of the reverse read is immediately available as getAlignmentEnd(), and the 5' position of the mate (the forward read) is immediately available as getMateAlignmentStart(). This is in contrast to the situation that occurs when we process the forward read of a pair, where the 5' position of the mate (reverse) read is not available because we don't know the length of the the mate, and the code uses TLEN.

A slight complication with this strategy is that we can't accept only reverse members of tandem pairs, because we would double count reverse tandem pairs and miss counting forward tandem pairs. But a slight modification of the acceptance conditions in InsertSizeMetricsCollector.acceptRecord() handles this, by accepting only the second reads in tandem cases, where TLEN isn't used anyway because only one orientation is possible, while accepting only the reverse reads in non-tandem cases. Although this does add an extra line of code to acceptRecord(), the code for RF/FR calling in GetPairOrientation is simplified down to a single conditional statement.

With this change, the relative positions of the 5' ends of the reads, and therefore the pair orientation, are determined independently of TLEN. All overlapping pairs are called FR (except when they only overlap by their single 5'-most bases), because the 5' ends of the forward reads are always left of the 5' ends of the reverse reads, so the ambiguity issue is resolved. The choice of FR for completely overlapping pairs is consistent with our existing test case in SamPairUtilTest.testGetPairOrientationDataProvider, which calls for FR for these pairs (this test has been passing while d-cameron's tests fail because the test explicitly specifies first read forward/second read reverse while the failing d-cameron tests specify the reverse). Pairs that overlap by only the single bases at their 5' ends are called as RF with the new strategy in order to be consistent with the standard implied in the existing GetPairOrientation(), but in light of d-cameron's point that RF reads can't overlap, perhaps we should also call these pairs as FR. Input on this point would be appreciated.

This strategy has the important added benefit of also handling the adapter overrun issue in many cases. Since pair orientation is now determined always by the correct relative positions of the 5' ends, adapter overruns are ignored in all cases where the overruns are at the 3' ends. In addition, overruns are ignored in all FR cases, since in these cases 5' overruns simply shift the 5' ends further apart while 3' overruns are ignored. So the only case that remains susceptible to overruns is the 5' overrun case for RF. We do have a test case in SamPairUtilTest.testGetPairOrientationDataProvider for "nojump outie" that would be susceptible, because the 5' ends are adjacent, but how often should we expect the 5' ends of mate pairs to be within a few bases, given that jumping libraries are supposed to be derived from initial fragments with lengths of thousands of bases?

I've written a test routine for these changes that handles the 15 pair orientation cases in testGetPairOrientationDataProvider, and I've also tested the complete CollectInsertSizeMetrics program using its existing test, a new SAM file with the 15 pair orientation cases, and d-cameron's test files, and all tests pass.

This change is written using a new, separate version of GetPairOrientation(), rather than a modification of the existing version, since GetPairOrientation() is also used in SamPairUtil.isProperPair() and CollectJumpingLibraryMetrics.

Thanks for any input.

from picard.

vdauwera avatar vdauwera commented on August 16, 2024

@yfarjoun This seems like it would be up your street to comment & review, yes?

from picard.

yfarjoun avatar yfarjoun commented on August 16, 2024

sure...I'll take a look.

On Tue, Jun 9, 2015 at 4:19 PM, Geraldine Van der Auwera <
[email protected]> wrote:

@yfarjoun https://github.com/yfarjoun This seems like it would be up
your street to comment & review, yes?


Reply to this email directly or view it on GitHub
#103 (comment)
.

from picard.

yfarjoun avatar yfarjoun commented on August 16, 2024

I'm a little confused by this whole discussion. My understanding of FR and RF is that the orientation is based on the 5' ends of the two reads, and specifically on whether the 5' end of the positive read is before (FR) or not (RF) that of the negative read. Lines 88 through 90 in SamPaiUtil:

       return ( positiveStrandFivePrimePos < negativeStrandFivePrimePos
                ? PairOrientation.FR
                : PairOrientation.RF );

Pictorially (the 5' end marked with a .):

<-----. regardless of read1 or read2 this is the negative strand
     .----->  thus this is an FR pair

<-----. 
.----->  and this

 <-----. 
.----->  and this




 <-----. 
       .----->  this is an RF pair

<-----. 
           .----->  as is this

Thus I haven't yet seen an RF pair in the discussion. In particular I do not understand how the reads presented by @d-cameron are ambiguous:

frf1    83  ref 100 255 10M =   100 10  AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr
frf1    163 ref 100 255 10M =   100 -10 AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr

The 5' ends here are 100 (for the positive strand) and 110 (for the negative strand), thus clearly a FR.
similarly all the other pairs in the discussion.

What am I missing?

Also @nenewell: You seem to be confused that the first read isn't mapped before the second read ("But the FLAGs indicate the forward read is the second in the pair and the reverse read is the first, which I don't understand, because the forward read is left of the reverse one."). Surely this is just a small misunderstanding: the first read is the read that came out of the sequencer first, not the one that maps to an earlier position in the genome.

from picard.

nenewell avatar nenewell commented on August 16, 2024

Thanks for looking at this yfarjoun. I’m going to have to re-familiarize myself with this before can respond. I’m getting ready for a two-week trip, so I don’t know if I’ll have time to respond before then, but I’ll try. You are correct that my confusion on the first/second issue was due to a misunderstanding of the meaning of first and second reads, which was cleared up by the discussion with d-cameron.

From: Yossi Farjoun
Sent: Friday, July 17, 2015 11:40 AM
To: broadinstitute/picard
Cc: nenewell
Subject: Re: [picard] CollectInsertSizeMetrics infers incorrect PairOrientation (#103)

I'm a little confused by this whole discussion. My understanding of FR and RF is that the orientation is based on the 5' ends of the two reads, and specifically on whether the 5' end of the positive read is before (FR) or not (RF) that of the negative read. Lines 88 through 90 in SamPaiUtil:

   return ( positiveStrandFivePrimePos < negativeStrandFivePrimePos
            ? PairOrientation.FR
            : PairOrientation.RF );Pictorially (the 5' end marked with a .):

<-----. regardless of read1 or read2 this is the negative strand
.-----> thus this is an FR pair

<-----.
.-----> and this

<-----.
.-----> and this

<-----.
.-----> this is an RF pair

<-----.
.-----> as is this
Thus I haven't yet seen an RF pair in the discussion. In particular I do not understand how the reads presented by @d-cameron are ambiguous:

frf1 83 ref 100 255 10M = 100 10 AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr
frf1 163 ref 100 255 10M = 100 -10 AAAAAAAAAAA AAAAAAAAAAA zz:Z:could_be_rf_or_fr
The 5' ends here are 100 (for the positive strand) and 110 (for the negative strand), thus clearly a FR.
similarly all the other pairs in the discussion.

What am I missing?

Also @nenewell: You seem to be confused that the first read isn't mapped before the second read ("But the FLAGs indicate the forward read is the second in the pair and the reverse read is the first, which I don't understand, because the forward read is left of the reverse one."). Surely this is just a small misunderstanding: the first read is the read that came out of the sequencer first, not the one that maps to an earlier position in the genome.


Reply to this email directly or view it on GitHub.

from picard.

nenewell avatar nenewell commented on August 16, 2024

After a quick reboot on this, this is the situation as I recall it: these reads are indeed unambiguously FR by the FR/RF definition based on the relative positions of the 5' ends. The problem is that the pair orientation caller calls many of these reads as RF. This happens because the caller routine relies on the FLAG and TLEN, and the sign of TLEN is ambiguous in the SAM spec for reads that completely overlap each other, because it is determined based on which of the reads is "leftmost" and which is "rightmost", instead of the relative 5' positions.

The changes I'm suggesting remove the dependence of pair orientation calling on TLEN in CollectInsertSizeMetrics, because if the reverse read of every pair is chosen during pair evaluation instead of the second read, then TLEN is not needed to determine orientation, since the 5' position of the reverse read is immediately available as getAlignmentEnd(), and the 5' position of the mate (the forward read) is immediately available as getMateAlignmentStart(). So the code in GetPairOrientation(), which currently does a computation relying on TLEN, is simplified and the 5' end-based definition of FR/RF is strictly enforced regardless of TLEN.

from picard.

yfarjoun avatar yfarjoun commented on August 16, 2024

Ah, I see now. the TLEN is needed to the negativeStrand5'pos when the read is on the positive strand. This is because the mate's 5' end position is assumed to be alignmentStart+TLEN. however, in some cases it could be alignmentStart-TLEN...

OK. Now that I'm on the same page...

I think your suggestion is fine with the following modification:

I would advocate using a single GetPairOrientation() (i.e. not changing it) but adding a comment to it that it can lead to unexpected results if given the forward strand in the case that.....you can link to this issue in the code. I'm OK with changing the < to <= which would effectively change the RF into FR in the ambiguous case, right?

from picard.

nenewell avatar nenewell commented on August 16, 2024

from picard.

yfarjoun avatar yfarjoun commented on August 16, 2024

My modification is only regarding GetPairOrientation(). You will still need
to change CollectInsertSizeMetrics to calculate over the reverse reads
only, as you suggested.

However, I'm going to ask that you put this issue on hold. I'd like to see
if we could change the definition of the sign of TLEN to be positive or
negative according to the 5' ends of the reads rather than the alignment
start of the reads...

Stay tuned!

On Fri, Jul 17, 2015 at 4:33 PM, nenewell [email protected] wrote:

To be sure I understand – you would like me to not change the code (except
for the inequality), but to incorporate a comment which describes and links
to the issue in the existing GetPairOrientation()? Of course, you have the
broad view of this stuff – but I am curious – are you concerned about
compatibility issues in the code, or consistency with results generated
previously or elsewhere?
Changing < to <= will resolve the ambiguous case of equivalent 5’
positions (but of course not the ambiguous case of mutually overlapping
reads) in favor of FR when the 5’ positions are correctly calculated...for
the existing code I’m not sure that it will always work this way – I could
check that.
From: Yossi Farjoun
Sent: Friday, July 17, 2015 3:22 PM
To: broadinstitute/picard
Cc: nenewell
Subject: Re: [picard] CollectInsertSizeMetrics infers incorrect
PairOrientation (#103)

Ah, I see now. the TLEN is needed to the negativeStrand5'pos when the read
is on the positive strand. This is because the mate's 5' end position is
assumed to be alignmentStart+TLEN. however, in some cases it could be
alignmentStart-TLEN...

OK. Now that I'm on the same page...

I think your suggestion is fine with the following modification:

I would advocate using a single GetPairOrientation() (i.e. not changing
it) but adding a comment to it that it can lead to unexpected results if
given the forward strand in the case that.....you can link to this issue in
the code. I'm OK with changing the < to <= which would effectively change
the RF into FR in the ambiguous case, right?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#103 (comment)
.

from picard.

yfarjoun avatar yfarjoun commented on August 16, 2024

The problem seems to be bigger than originally reported for two reasons:

  1. According to @lh3, BWA reports TLEN to be the 5'-5' distance despite the spec (see http://sourceforge.net/p/samtools/mailman/message/32148085/ for latest discussion) and so we should be looking for the 5' end using UnclippedAlignmentStart()+TLEN rather than AlignementStart()+TLEN
    Similarly, the code should be looking for the UnclippedAlignmentEnd since that is needed to determine orientation, not the alignment end.
  2. It seems that the definition of TLEN is the problem to begin with since (for example) POS=20 and TLEN=-11 could mean to very different pairs (even if we know that the strands are +/-):
    20
     |------>     Since both reads align to position 20 the sign of TLEN
     <------|     isn't currently well defined in the spec and could be -11, 
           30

                      20
                       |-------->
   <--------|
           10

Given that we currently use alignment and not unclipped alignment to determine orientation, your solution, @nenewell, should give consistent results with what picard does now. However, it seems that in order to really get the correct answer we need to parse the cigar string and extract the unclipped ends directly. At that point we can ignore the TLEN altogether. This can be avoided if TLEN spec is modified to encode the distance between the 5' ends and its sign records which 5' end comes first. (At that point the sign of TLEN will be very close to the answer we are looking for...)

from picard.

nenewell avatar nenewell commented on August 16, 2024

Thanks for clarifying what you meant with your modification yfarjoun. I was a bit rushed and I misunderstood. That's very interesting that TLEN reports the 5'-5' distance, rather than the template length as the SAM spec says. This explains for me exactly why the current getPairOrientation() code works when it processes forward reads in RF pairs: "NegativeStrand5'Position = getAlignmentStart() + TLEN" doesn't make sense for RF if TLEN means the whole fragment length.

I hadn't thought about the implications of clipping for pair orientation calling...if we want to account for clipping, I don't think either the existing code or my version will work. The problem is that both rely on getMateAlignmentStart() to compute the PositiveStrand5'Position for reads on the reverse strand. And we don't have access to the mate's cigar, because we don't have any kind of pointer to the mate read - we just have its start position. (I think it's very unfortunate in the SAM spec that since the "read name" is actually the template name, paired reads don't have unique identifiers, and one member of a pair knows little about the other.) Is this right?

Your solution of having the sign of TLEN encode which 5' end comes first gets around this issue of course. If the aligner does this, it's essentially calling the pair for us. Is there any reason why it shouldn't do this? Is there uncertainty related to the mapping that might make it inappropriate for the aligner to definitively call the pairs? Actually, if the aligner can do this, it should just add a FLAG (or a new field) that specifies RF, FR, or tandem.

from picard.

vdauwera avatar vdauwera commented on August 16, 2024

Closing given old age of thread and lack of concrete proposal.

from picard.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.