MarkDuplicates question about picard HOT 9 CLOSED

broadinstitute commented on August 16, 2024

MarkDuplicates question

from picard.

Comments (9)

splaisan commented on August 16, 2024

Dear,

If I may add a small comment,
I have been searching for tools to discriminate and process separately optical and pcr duplicates and did not find much so far. At best I could et counts of each but did not yet succeed to remove one or the other type specifically. This (if I am not totally wrong) would be a very nice addition too)

Stephane Plaisance
[email protected]

On 12 Mar 2015, at 14:27, Geraldine Van der Auwera [email protected] wrote:

User is confused about exactly what criteria are used by MarkDuplicates. GATK docs say it marks as dupes reads that have the same start pos and identical CIGAR, but it looks like Picard docs say nothing about CIGAR, and just mentions length. We need to reconcile this (I expect the GATK doc needs to be corrected).

Secondary question is how it chooses which read is kept as non-dupe -- random or highest MAPQ?

This Issue was generated from your forums

—
Reply to this email directly or view it on GitHub.

from picard.

nh13 commented on August 16, 2024

@vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.

@splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.

from picard.

splaisan commented on August 16, 2024

Dear Nils,

Thanks for your answer, looking into the code is not really an option for me (biologist with only basic perl/bash knowledge) :-)
I thought that if diagnose counts can be returned of optical (based on physical position on the chip) and PCR duplicates (generated while creating the library and identified by the start position on the genome + ...), then some software should also be able to do something with these identified reads!
Probably I was a bit too optimist
Thanks anyway for the effort and your answer, I will keep searching for the ideal tool.

Stephane Plaisance
[email protected]

On 12 Mar 2015, at 15:44, Nils Homer [email protected] wrote:

@vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.

@splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.

—
Reply to this email directly or view it on GitHub.

from picard.

tfenne commented on August 16, 2024

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html http://broadinstitute.github.io/picard/faq.html

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected] wrote:

@vdauwera https://github.com/vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.

—
Reply to this email directly or view it on GitHub #166 (comment).

from picard.

vdauwera commented on August 16, 2024

Indeed, thanks Tim.

From the FAQ:

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled) is
to find the 5' coordinates and mapping orientations of each read pair. When
doing this it takes into account all clipping that has taking place as well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest sum
of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will not
fail due to inability to find the mate pair for a read.

On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:

@vdauwera https://github.com/vdauwera we should write this up, as it
is not what is in the GATK doc and not what you describe above. This would
be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the admittedly
complex code in the markduplicates subdirectories:
https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
<
https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

No such user tool exists.

—
Reply to this email directly or view it on GitHub <
#166 (comment)
.

—
Reply to this email directly or view it on GitHub
#166 (comment)
.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute

from picard.

vdauwera commented on August 16, 2024

Will refer GATK users to the Picard FAQ and correct any GATK docs with erroneous info.

from picard.

amilev commented on August 16, 2024

Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:

Indeed, thanks Tim.

From the FAQ:

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled) is
to find the 5' coordinates and mapping orientations of each read pair. When
doing this it takes into account all clipping that has taking place as well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest sum
of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will not
fail due to inability to find the mate pair for a read.

On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:

@vdauwera https://github.com/vdauwera we should write this up, as it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the admittedly
complex code in the markduplicates subdirectories:

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
<

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

No such user tool exists.

—
Reply to this email directly or view it on GitHub <

#166 (comment)

.

—
Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>
.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute

—
Reply to this email directly or view it on GitHub
#166 (comment)
.

from picard.

vdauwera commented on August 16, 2024

Hmm. I would expect it to work, but no guarantees.

On Thu, Mar 12, 2015 at 6:44 PM, amilev [email protected] wrote:

Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:

Indeed, thanks Tim.

From the FAQ:

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled)
is
to find the 5' coordinates and mapping orientations of each read pair.
When
doing this it takes into account all clipping that has taking place as
well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest
sum
of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will
not
fail due to inability to find the mate pair for a read.

On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:

@vdauwera https://github.com/vdauwera we should write this up, as
it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the
admittedly
complex code in the markduplicates subdirectories:

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

<

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

No such user tool exists.

—
Reply to this email directly or view it on GitHub <

#166 (comment)

.

—
Reply to this email directly or view it on GitHub
<

https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>

.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute

—
Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78540865>
.

—
Reply to this email directly or view it on GitHub
#166 (comment)
.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute

from picard.

yfarjoun commented on August 16, 2024

if the data is pair-ended, it doesn't look at the bases. alignment
information only.

On Thu, Mar 12, 2015 at 7:52 PM, Geraldine Van der Auwera <
[email protected]> wrote:

Hmm. I would expect it to work, but no guarantees.

On Thu, Mar 12, 2015 at 6:44 PM, amilev [email protected] wrote:

Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:

Indeed, thanks Tim.

From the FAQ:

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also
handled)
is
to find the 5' coordinates and mapping orientations of each read pair.
When
doing this it takes into account all clipping that has taking place as
well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would
the
5' most base have been aligned". It then matches all read pairs that
have
identical 5' coordinates and orientations and marks as duplicates all
but
the "best" pair. "Best" is defined as the read pair having the highest
sum
of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will
not
fail due to inability to find the mate pair for a read.

On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell <
[email protected]>
wrote:

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer <[email protected]

wrote:

@vdauwera https://github.com/vdauwera we should write this up,
as
it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the
admittedly
complex code in the markduplicates subdirectories:

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

<

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

No such user tool exists.

—
Reply to this email directly or view it on GitHub <

#166 (comment)

.

—
Reply to this email directly or view it on GitHub
<

https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>

.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute

—
Reply to this email directly or view it on GitHub
<

https://github.com/broadinstitute/picard/issues/166#issuecomment-78540865>

.

—
Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78669109>
.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute

—
Reply to this email directly or view it on GitHub
#166 (comment)
.

from picard.

MarkDuplicates question about picard HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent