Comments (9)
Dear,
If I may add a small comment,
I have been searching for tools to discriminate and process separately optical and pcr duplicates and did not find much so far. At best I could et counts of each but did not yet succeed to remove one or the other type specifically. This (if I am not totally wrong) would be a very nice addition too)
Stephane Plaisance
[email protected]
On 12 Mar 2015, at 14:27, Geraldine Van der Auwera [email protected] wrote:
User is confused about exactly what criteria are used by MarkDuplicates. GATK docs say it marks as dupes reads that have the same start pos and identical CIGAR, but it looks like Picard docs say nothing about CIGAR, and just mentions length. We need to reconcile this (I expect the GATK doc needs to be corrected).
Secondary question is how it chooses which read is kept as non-dupe -- random or highest MAPQ?
This Issue was generated from your forums
—
Reply to this email directly or view it on GitHub.
from picard.
@vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.
@splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.
from picard.
Dear Nils,
Thanks for your answer, looking into the code is not really an option for me (biologist with only basic perl/bash knowledge) :-)
I thought that if diagnose counts can be returned of optical (based on physical position on the chip) and PCR duplicates (generated while creating the library and identified by the start position on the genome + ...), then some software should also be able to do something with these identified reads!
Probably I was a bit too optimist
Thanks anyway for the effort and your answer, I will keep searching for the ideal tool.
Stephane Plaisance
[email protected]
On 12 Mar 2015, at 15:44, Nils Homer [email protected] wrote:
@vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.
@splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.—
Reply to this email directly or view it on GitHub.
from picard.
FYI most of this is covered in the picard FAQ:
http://broadinstitute.github.io/picard/faq.html http://broadinstitute.github.io/picard/faq.html
-t
On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected] wrote:
@vdauwera https://github.com/vdauwera we should write this up, as it is not what is in the GATK doc and not what you describe above. This would be a great thing to document correctly.
@splaisan https://github.com/splaisan you can look at the admittedly complex code in the markduplicates subdirectories: https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.—
Reply to this email directly or view it on GitHub #166 (comment).
from picard.
Indeed, thanks Tim.
From the FAQ:
Q: How does MarkDuplicates work?
A: Essentially what it does (for pairs; single-end data is also handled) is
to find the 5' coordinates and mapping orientations of each read pair. When
doing this it takes into account all clipping that has taking place as well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest sum
of base qualities as bases with Q >= 15.
If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will not
fail due to inability to find the mate pair for a read.
On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:
FYI most of this is covered in the picard FAQ:
http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>-t
On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:@vdauwera https://github.com/vdauwera we should write this up, as it
is not what is in the GATK doc and not what you describe above. This would
be a great thing to document correctly.@splaisan https://github.com/splaisan you can look at the admittedly
complex code in the markduplicates subdirectories:
https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
<
https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicatesNo such user tool exists.
—
Reply to this email directly or view it on GitHub <
#166 (comment)
.—
Reply to this email directly or view it on GitHub
#166 (comment)
.
Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute
from picard.
Will refer GATK users to the Picard FAQ and correct any GATK docs with erroneous info.
from picard.
Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:
Indeed, thanks Tim.
From the FAQ:
Q: How does MarkDuplicates work?
A: Essentially what it does (for pairs; single-end data is also handled) is
to find the 5' coordinates and mapping orientations of each read pair. When
doing this it takes into account all clipping that has taking place as well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest sum
of base qualities as bases with Q >= 15.If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will not
fail due to inability to find the mate pair for a read.On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:FYI most of this is covered in the picard FAQ:
http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>-t
On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:@vdauwera https://github.com/vdauwera we should write this up, as it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.@splaisan https://github.com/splaisan you can look at the admittedly
complex code in the markduplicates subdirectories:https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
<https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.
—
Reply to this email directly or view it on GitHub <.
—
Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>
.Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute—
Reply to this email directly or view it on GitHub
#166 (comment)
.
from picard.
Hmm. I would expect it to work, but no guarantees.
On Thu, Mar 12, 2015 at 6:44 PM, amilev [email protected] wrote:
Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:Indeed, thanks Tim.
From the FAQ:
Q: How does MarkDuplicates work?
A: Essentially what it does (for pairs; single-end data is also handled)
is
to find the 5' coordinates and mapping orientations of each read pair.
When
doing this it takes into account all clipping that has taking place as
well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would the
5' most base have been aligned". It then matches all read pairs that have
identical 5' coordinates and orientations and marks as duplicates all but
the "best" pair. "Best" is defined as the read pair having the highest
sum
of base qualities as bases with Q >= 15.If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will
not
fail due to inability to find the mate pair for a read.On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell [email protected]
wrote:FYI most of this is covered in the picard FAQ:
http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>-t
On Mar 12, 2015, at 10:44 AM, Nils Homer [email protected]
wrote:@vdauwera https://github.com/vdauwera we should write this up, as
it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.@splaisan https://github.com/splaisan you can look at the
admittedly
complex code in the markduplicates subdirectories:https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
<
https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.
—
Reply to this email directly or view it on GitHub <.
—
Reply to this email directly or view it on GitHub
<https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>
.
Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute—
Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78540865>
.—
Reply to this email directly or view it on GitHub
#166 (comment)
.
Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute
from picard.
if the data is pair-ended, it doesn't look at the bases. alignment
information only.
On Thu, Mar 12, 2015 at 7:52 PM, Geraldine Van der Auwera <
[email protected]> wrote:
Hmm. I would expect it to work, but no guarantees.
On Thu, Mar 12, 2015 at 6:44 PM, amilev [email protected] wrote:
Do you know if it work also with the RNA N's?
On Mar 12, 2015 1:36 PM, "Geraldine Van der Auwera" <
[email protected]> wrote:Indeed, thanks Tim.
From the FAQ:
Q: How does MarkDuplicates work?
A: Essentially what it does (for pairs; single-end data is also
handled)
is
to find the 5' coordinates and mapping orientations of each read pair.
When
doing this it takes into account all clipping that has taking place as
well
as any gaps or jumps in the alignment. You can thus think of it as
determining "if all the bases from the read were aligned, where would
the
5' most base have been aligned". It then matches all read pairs that
have
identical 5' coordinates and orientations and marks as duplicates all
but
the "best" pair. "Best" is defined as the read pair having the highest
sum
of base qualities as bases with Q >= 15.If your reads have been divided into separate BAMs by chromosome,
inter-chromosomal pairs will not be identified, but MarkDuplicates will
not
fail due to inability to find the mate pair for a read.On Thu, Mar 12, 2015 at 12:25 PM, Tim Fennell <
[email protected]>
wrote:FYI most of this is covered in the picard FAQ:
http://broadinstitute.github.io/picard/faq.html <
http://broadinstitute.github.io/picard/faq.html>-t
On Mar 12, 2015, at 10:44 AM, Nils Homer <[email protected]
wrote:
@vdauwera https://github.com/vdauwera we should write this up,
as
it
is not what is in the GATK doc and not what you describe above. This
would
be a great thing to document correctly.@splaisan https://github.com/splaisan you can look at the
admittedly
complex code in the markduplicates subdirectories:https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
<
https://github.com/broadinstitute/picard/tree/master/src/java/picard/sam/markduplicates
No such user tool exists.
—
Reply to this email directly or view it on GitHub <.
—
Reply to this email directly or view it on GitHub
<https://github.com/broadinstitute/picard/issues/166#issuecomment-78516925>
.
Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute—
Reply to this email directly or view it on GitHub
<https://github.com/broadinstitute/picard/issues/166#issuecomment-78540865>
.
—
Reply to this email directly or view it on GitHub
<
https://github.com/broadinstitute/picard/issues/166#issuecomment-78669109>
.Geraldine A. Van der Auwera, Ph.D.
Bioinformatics Scientist II
GATK Support & Outreach
Broad Institute—
Reply to this email directly or view it on GitHub
#166 (comment)
.
from picard.
Related Issues (20)
- Exception in thread "main" htsjdk.samtools.SAMException: Cannot read non-existent file HOT 3
- CollectAlignmentSummaryMetrics throws exception upon an empty BAM file (header only) HOT 2
- Possible MarkDuplicates bug: no optical duplicates found HOT 2
- SamToFastq writes to /dev/stdout Exception HOT 1
- Order of rows in crosscheck metrics files are non-deterministic HOT 4
- OutOfMemoryError HOT 2
- CollectAlignmentSummaryMetrics fails on some small inputs due to unsafe Histogram methods HOT 1
- Picard new release HOT 6
- SamToFastq HOT 5
- GENOTYPING_ERROR_RATE doesn't do anything in CrosscheckFingerprints HOT 3
- Sum of Which columns in MarkDulicates metrices gives us total reads? HOT 1
- Question about Depth HOT 1
- Sequence dictionaries are not the same size HOT 5
- CONVERT BAM TO U-BAM Error HOT 2
- CollectSamErrorMetrics overlap total bases HOT 1
- BUILD FAILED error HOT 1
- Mark Duplicates - Not Enough Fields Exception HOT 2
- Samtofastq does not retain all the information HOT 5
- Picard build failed HOT 1
- PICARD termination , HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from picard.