Code Monkey home page Code Monkey logo

Comments (8)

kockan avatar kockan commented on August 16, 2024

Turns out there are some tests for this case already (and they seem to pass):

https://github.com/broadinstitute/picard/blob/master/testdata/picard/util/BedToIntervalListTest/zero_base_interval.bed
https://github.com/broadinstitute/picard/blob/master/testdata/picard/util/BedToIntervalListTest/zero_base_interval.bed.interval_list
https://github.com/broadinstitute/picard/blob/master/testdata/picard/util/BedToIntervalListTest/zero_length_interval_at_first_position_in_contig.bed
https://github.com/broadinstitute/picard/blob/master/testdata/picard/util/BedToIntervalListTest/zero_length_interval_at_first_position_in_contig.bed.interval_list

@rickymagner Do you believe it would be better (and more correct in a sense) if the output interval lists for the test cases above were empty instead (besides the header)?

from picard.

yfarjoun avatar yfarjoun commented on August 16, 2024

Interval Lists should be able to represent empty intervals, but since they use 1-based inclusive coordinates, as opposed to 0-based half-open (like bed) there needs to be a coordinate change on "start" (but not "end"). The result isn't illegal....

Which tools fail on empty intervals in an interval list?

from picard.

rickymagner avatar rickymagner commented on August 16, 2024

Perhaps the problem isn't this tool then but how GATK uses the -L flag. The example that I saw this failing on was running gatk CountVariants -V example.vcf -L test.interval_list, but it seems to also fail on test.bed as well. The error is:

A USER ERROR has occurred: Badly formed genome unclippedLoc: Parameters to GenomeLocParser are incorrect:The stop position 100 is less than start 101 in contig chr1

Or maybe it's just localized to CountVariants? If we're confident most tools should be able to handle this edge case properly, I'd be OK treating this as a CountVariants bug rather than BedToIntervalList.

from picard.

lbergelson avatar lbergelson commented on August 16, 2024

GATK interval processing will reject empty intervals by design.

The last time we looked into this all the empty intervals in bed files we found were errors in the bed file, not meaningful data. I know there's an ongoing "but what about insertions!" argument bug I thought we had agreed that insertions were going to be represented like in vcf.

If you use empty intervals htsjdk will process them inconsistently and your results will be wrong because the math for empty intervals is inconsistent and no one has ever cared enough to fix it.

@yfarjoun Is there a new use case that actually needs empty intervals?

from picard.

lbergelson avatar lbergelson commented on August 16, 2024

See samtools/htsjdk#1320 for a list of ongoing issues.

from picard.

lbergelson avatar lbergelson commented on August 16, 2024

Lets add a --remove-zero-length flag to these tools.

from picard.

droazen avatar droazen commented on August 16, 2024

Consensus is that we should just add a --keep-zero-length-intervals flag to the tool, and skip them by default.

from picard.

rickymagner avatar rickymagner commented on August 16, 2024

This discussion should now be resolved by #1928

from picard.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.