Code Monkey home page Code Monkey logo

Comments (9)

splaisan avatar splaisan commented on August 16, 2024

Dear,

If I may add a small comment,
I have been searching for tools to discriminate and process separately optical and pcr duplicates and did not find much so far. At best I could et counts of each but did not yet succeed to remove one or the other type specifically. This (if I am not totally wrong) would be a very nice addition too)

Stephane Plaisance
[email protected]

On 12 Mar 2015, at 14:27, Geraldine Van der Auwera [email protected] wrote:

User is confused about exactly what criteria are used by MarkDuplicates. GATK docs say it marks as dupes reads that have the same start pos and identical CIGAR, but it looks like Picard docs say nothing about CIGAR, and just mentions length. We need to reconcile this (I expect the GATK doc needs to be corrected).

Secondary question is how it chooses which read is kept as non-dupe -- random or highest MAPQ?

This Issue was generated from your forums


Reply to this email directly or view it on GitHub.

from picard.

nh13 avatar nh13 commented on August 16, 2024

@vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.

@splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.

from picard.

splaisan avatar splaisan commented on August 16, 2024

Dear Nils,

Thanks for your answer, looking into the code is not really an option for me (biologist with only basic perl/bash knowledge) :-)
I thought that if diagnose counts can be returned of optical (based on physical position on the chip) and PCR duplicates (generated while creating the library and identified by the start position on the genome + ...), then some software should also be able to do something with these identified reads!
Probably I was a bit too optimist
Thanks anyway for the effort and your answer, I will keep searching for the ideal tool.

Stephane Plaisance
[email protected]

On 12 Mar 2015, at 15:44, Nils Homer [email protected] wrote:

@vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.

@splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.


Reply to this email directly or view it on GitHub.

from picard.

tfenne avatar tfenne commented on August 16, 2024

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html http://broadinstitute.github.io/picard/faq.html

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected] wrote:

@vdauwera https://github.com/vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.


Reply to this email directly or view it on GitHub #166 (comment).

from picard.

vdauwera avatar vdauwera commented on August 16, 2024

Indeed, thanks Tim.

From the FAQ:

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled) is
to find the 5' coordinates and mapping orientations of each read pair. When
doing this it takes into account all clipping that has taking place as well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest sum
of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will not
fail due to inability to find the mate pair for a read.

On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:

@vdauwera https://github.com/vdauwera we should write this up, as it
is not what is in the GATK doc and not what you describe above. This would
be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the admittedly
complex code in the markduplicates subdirectories:
https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
<
https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

No such user tool exists.


Reply to this email directly or view it on GitHub <
#166 (comment)
.


Reply to this email directly or view it on GitHub
#166 (comment)
.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute

from picard.

vdauwera avatar vdauwera commented on August 16, 2024

Will refer GATK users to the Picard FAQ and correct any GATK docs with erroneous info.

from picard.

amilev avatar amilev commented on August 16, 2024

Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:

Indeed, thanks Tim.

From the FAQ:

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled) is
to find the 5' coordinates and mapping orientations of each read pair. When
doing this it takes into account all clipping that has taking place as well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest sum
of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will not
fail due to inability to find the mate pair for a read.

On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:

@vdauwera https://github.com/vdauwera we should write this up, as it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the admittedly
complex code in the markduplicates subdirectories:

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
<

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

No such user tool exists.


Reply to this email directly or view it on GitHub <

#166 (comment)

.


Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>
.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute


Reply to this email directly or view it on GitHub
#166 (comment)
.

from picard.

vdauwera avatar vdauwera commented on August 16, 2024

Hmm. I would expect it to work, but no guarantees.

On Thu, Mar 12, 2015 at 6:44 PM, amilev [email protected] wrote:

Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:

Indeed, thanks Tim.

From the FAQ:

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled)
is
to find the 5' coordinates and mapping orientations of each read pair.
When
doing this it takes into account all clipping that has taking place as
well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest
sum
of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will
not
fail due to inability to find the mate pair for a read.

On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:

@vdauwera https://github.com/vdauwera we should write this up, as
it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the
admittedly
complex code in the markduplicates subdirectories:

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

<

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

No such user tool exists.


Reply to this email directly or view it on GitHub <

#166 (comment)

.


Reply to this email directly or view it on GitHub
<

https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>

.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute


Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78540865>
.


Reply to this email directly or view it on GitHub
#166 (comment)
.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute

from picard.

yfarjoun avatar yfarjoun commented on August 16, 2024

if the data is pair-ended, it doesn't look at the bases. alignment
information only.

On Thu, Mar 12, 2015 at 7:52 PM, Geraldine Van der Auwera <
[email protected]> wrote:

Hmm. I would expect it to work, but no guarantees.

On Thu, Mar 12, 2015 at 6:44 PM, amilev [email protected] wrote:

Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:

Indeed, thanks Tim.

From the FAQ:

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also
handled)
is
to find the 5' coordinates and mapping orientations of each read pair.
When
doing this it takes into account all clipping that has taking place as
well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would
the
5' most base have been aligned". It then matches all read pairs that
have
identical 5' coordinates and orientations and marks as duplicates all
but
the "best" pair. "Best" is defined as the read pair having the highest
sum
of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will
not
fail due to inability to find the mate pair for a read.

On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell <
[email protected]>
wrote:

FYI most of this is covered in the picard FAQ:

http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>

-t

On Mar 12, 2015, at 10:44 AM, Nils Homer <[email protected]

wrote:

@vdauwera https://github.com/vdauwera we should write this up,
as
it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.

@splaisan https://github.com/splaisan you can look at the
admittedly
complex code in the markduplicates subdirectories:

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

<

https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates

No such user tool exists.


Reply to this email directly or view it on GitHub <

#166 (comment)

.


Reply to this email directly or view it on GitHub
<

https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>

.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute


Reply to this email directly or view it on GitHub
<

https://github.com/broadinstitute/picard/issues/166#issuecomment-78540865>

.


Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78669109>
.

Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute


Reply to this email directly or view it on GitHub
#166 (comment)
.

from picard.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.