Code Monkey home page Code Monkey logo

Comments (6)

NicolasHug avatar NicolasHug commented on June 16, 2024

Thanks a lot for the detailed feature request @MarcSzafraniec . This sounds reasonable overall, I shared more detailed thoughts below. LMK what you think!

A functional version of SanitizeBoundingBoxes

Sounds good. Not sure whether we should allow labels and other keys here, or just bouding boxes, we'll see. Maybe we can get away by just returning the valid mask and let the user subsample they input.

A version of SanitizeBoundingBoxes that is aligned with other detection transforms: it should either take a BoundingBoxes TVTensor, or other transforms should be able to take in a dictionary that contains boxes, among others.

All of this is already supported. I.e. SanitizeBoundingBoxes() supports a single BoundingBoxes() object as input:

from torchvision.transforms import v2
from torchvision import tv_tensors
import torch

bbox = tv_tensors.BoundingBoxes(torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]]), canvas_size=(100, 100), format="XYXY")
v2.SanitizeBoundingBoxes(labels_getter=None)(bbox)

# BoundingBoxes([[ 0,  0, 10, 10],
#                [10, 10, 20, 20]], format=BoundingBoxFormat.XYXY, canvas_size=(100, 100))

and the other transforms do support dictionary and arbitrarily nested inputs, just like SanitizeBoundingBoxes: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_getting_started.html#what-do-i-pass-as-input

You can definitely use SanitizeBoundingBoxes() within Compose() for example: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_e2e.html#transforms

It seems like something wasn't clear in the docs here - I'd love for you to let us know what could be improved on that front.

Finally, an important drawback of SanitizeBoundingBoxes is that it subsets the boxes and the labels together, but not other detection features such as area or iscrowd. This is error prone, and this transform should at least be able to accommodate the standard COCO format.

In general it's going to be untracktable to subset arbitrary inputs (and potentially inconvenient, since it would force those inputs to be in a specific expected format). But perhaps it would be enough let the functional return the subsampling maks (as suggested above), and let the user do the indexing themselves on their desired ipnuts (area, iscrowd, etc.) ?

from vision.

MarcSzafraniec avatar MarcSzafraniec commented on June 16, 2024

Thanks a lot for the detailed feature request @MarcSzafraniec . This sounds reasonable overall, I shared more detailed thoughts below. LMK what you think!

A functional version of SanitizeBoundingBoxes

Sounds good. Not sure whether we should allow labels and other keys here, or just bouding boxes, we'll see. Maybe we can get away by just returning the valid mask and let the user subsample they input.

A version of SanitizeBoundingBoxes that is aligned with other detection transforms: it should either take a BoundingBoxes TVTensor, or other transforms should be able to take in a dictionary that contains boxes, among others.

All of this is already supported. I.e. SanitizeBoundingBoxes() supports a single BoundingBoxes() object as input:

from torchvision.transforms import v2
from torchvision import tv_tensors
import torch

bbox = tv_tensors.BoundingBoxes(torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]]), canvas_size=(100, 100), format="XYXY")
v2.SanitizeBoundingBoxes(labels_getter=None)(bbox)

# BoundingBoxes([[ 0,  0, 10, 10],
#                [10, 10, 20, 20]], format=BoundingBoxFormat.XYXY, canvas_size=(100, 100))

and the other transforms do support dictionary and arbitrarily nested inputs, just like SanitizeBoundingBoxes: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_getting_started.html#what-do-i-pass-as-input

You can definitely use SanitizeBoundingBoxes() within Compose() for example: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_e2e.html#transforms

It seems like something wasn't clear in the docs here - I'd love for you to let us know what could be improved on that front.

Finally, an important drawback of SanitizeBoundingBoxes is that it subsets the boxes and the labels together, but not other detection features such as area or iscrowd. This is error prone, and this transform should at least be able to accommodate the standard COCO format.

In general it's going to be untracktable to subset arbitrary inputs (and potentially inconvenient, since it would force those inputs to be in a specific expected format). But perhaps it would be enough let the functional return the subsampling maks (as suggested above), and let the user do the indexing themselves on their desired ipnuts (area, iscrowd, etc.) ?

Thanks for answering so quickly ! Well, indeed we can pass either a BoundingBoxes object to SanitizeBoundingBoxes, or a dict to other transforms, but then the mask information is lost.

Maybe a solution would be to have a "labels_getter" that could be in fact a list of dictionary keys for which to apply the mask ? It would be simple to implement and probably generic enough.

from vision.

NicolasHug avatar NicolasHug commented on June 16, 2024

Summarizing an offline chat with @MarcSzafraniec :

A. Allow users to subsample arbitrary stuff

We need to allow users to subsample arbitrary stuff, not just the labels (e.g. "iscrowd" info, etc). This can be done in different ways, with the constraint that we don't want to enforce a specific input structure as this would go against a fundamental design principle of v2 (e.g. we don't want to enforce specific keys or even a dict structure).

One suggested solution would be to let labels_getter return a tuple of tensors. Each tensor in that tuple gets subsampled, just like the label tensor. The tuple entries technically don't need to be tensors, they just have to be anything that supports [bool_mask] indexing.

B. Having a functional

We've left out the functional implementation of SanitizeBoundingBoxes, but it can still be useful. This functional should be able to subsample the BoundingBoxes objects but also the labels and arbitrary tensors. I see 2 main ways of doing this for now (there's probably others), both of which deviate from what we typically call a functional (then again, this SanitizeBoundingBoxes is an outlier itself)

Option 1

def sanitize_bounding_boxes(boxes, *args):
    # subsample boxes and everything in *args

# User code:
boxes, labels, iscrowd = sanitize_bounding_boxes(boxes, labels, is_crowd)

Option 2

def get_sanitize_bouding_boxes_mask(boxes):
    # just return the boolean subsampling mask
	return is_valid

# User code:
valid_mask = get_sanitize_bouding_boxes_mask(boxes)
boxes, labels, iscrowd = boxes[valid_mask], labels[valid_mask], iscrowd[valid_mask]

Note: a way to get the valid_mask with option 1 is to pass boxes, valid_indices = sanitize_bounding_boxes(boxes, torch.arange(boxes.shape[0]))

EDIT: actually Option2 is not as pleasant as I thought because it forces users to call boxes = tv_tensors.wrap(...) to preserve the BoundingBoxes type.

CC @pmeier @vfdev-5 for your input on both design discussions here (A and B).

from vision.

pmeier avatar pmeier commented on June 16, 2024

A:

👍


B:

I like option 1 best, but I don't think JIT supports *args. So this is another way this would deviate from our current implementation.

Plus, when you say "functional" here, do you mean an actual functional, i.e. TVTensor input or a kernel? In the former case how would you handle rewrapping?

Without breaking any of our current paradigms, let me propose a third option:

def sanitize_bounding_boxes(boxes):
    # compute subsampling mask and apply to boxes
    return boxes, is_valid

That way people can use the mask if needed without extra computation, and don't have to subsample their boxes themselves. If we treat this function as kernel rather than as "functional", we also don't have an issue with the tuple return. For boxes we already have kernels that return a tuple, with the first item always being the boxes and the remaining parts being metadata.

from vision.

MarcSzafraniec avatar MarcSzafraniec commented on June 16, 2024

I have an additional comment: why this check if min_size < 1: raise ValueError(f"min_size must be >= 1, got {min_size}.") ? Boxes could be relative, or even just floating point, in that case its possible to have a width or height < 1 ...

from vision.

NicolasHug avatar NicolasHug commented on June 16, 2024

@MarcSzafraniec that's because in torchvision the expected format for the bounding boxes coordinates is absolute, not relative. The rest of the transforms also expect e.g. the XYXY format to be in absolute pixel coordinates.

from vision.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.