🚀 The feature A functional version of SanitizeBoundingBoxes

Thanks a lot for the detailed feature request <a class="user-mention notranslate" data

Thanks a lot for the detailed feature request <a class="user-mention notr

Summarizing an offline chat with <a class="user-mention notranslate" data-hovercard-ty

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cleaner SanitizeBoundingBoxes transform about vision HOT 6 CLOSED

MarcSzafraniec commented on September 26, 2024

Cleaner SanitizeBoundingBoxes transform

from vision.

Comments (6)

NicolasHug commented on September 26, 2024

Thanks a lot for the detailed feature request @MarcSzafraniec . This sounds reasonable overall, I shared more detailed thoughts below. LMK what you think!

A functional version of SanitizeBoundingBoxes

Sounds good. Not sure whether we should allow labels and other keys here, or just bouding boxes, we'll see. Maybe we can get away by just returning the valid mask and let the user subsample they input.

A version of SanitizeBoundingBoxes that is aligned with other detection transforms: it should either take a BoundingBoxes TVTensor, or other transforms should be able to take in a dictionary that contains boxes, among others.

All of this is already supported. I.e. SanitizeBoundingBoxes() supports a single BoundingBoxes() object as input:

from torchvision.transforms import v2
from torchvision import tv_tensors
import torch

bbox = tv_tensors.BoundingBoxes(torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]]), canvas_size=(100, 100), format="XYXY")
v2.SanitizeBoundingBoxes(labels_getter=None)(bbox)

# BoundingBoxes([[ 0,  0, 10, 10],
#                [10, 10, 20, 20]], format=BoundingBoxFormat.XYXY, canvas_size=(100, 100))

and the other transforms do support dictionary and arbitrarily nested inputs, just like SanitizeBoundingBoxes: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_getting_started.html#what-do-i-pass-as-input

You can definitely use SanitizeBoundingBoxes() within Compose() for example: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_e2e.html#transforms

It seems like something wasn't clear in the docs here - I'd love for you to let us know what could be improved on that front.

Finally, an important drawback of SanitizeBoundingBoxes is that it subsets the boxes and the labels together, but not other detection features such as area or iscrowd. This is error prone, and this transform should at least be able to accommodate the standard COCO format.

In general it's going to be untracktable to subset arbitrary inputs (and potentially inconvenient, since it would force those inputs to be in a specific expected format). But perhaps it would be enough let the functional return the subsampling maks (as suggested above), and let the user do the indexing themselves on their desired ipnuts (area, iscrowd, etc.) ?

from vision.

MarcSzafraniec commented on September 26, 2024

Thanks a lot for the detailed feature request @MarcSzafraniec . This sounds reasonable overall, I shared more detailed thoughts below. LMK what you think!

A functional version of SanitizeBoundingBoxes

Sounds good. Not sure whether we should allow labels and other keys here, or just bouding boxes, we'll see. Maybe we can get away by just returning the valid mask and let the user subsample they input.

A version of SanitizeBoundingBoxes that is aligned with other detection transforms: it should either take a BoundingBoxes TVTensor, or other transforms should be able to take in a dictionary that contains boxes, among others.

All of this is already supported. I.e. SanitizeBoundingBoxes() supports a single BoundingBoxes() object as input:
from torchvision.transforms import v2
from torchvision import tv_tensors
import torch

bbox = tv_tensors.BoundingBoxes(torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]]), canvas_size=(100, 100), format="XYXY")
v2.SanitizeBoundingBoxes(labels_getter=None)(bbox)

# BoundingBoxes([[ 0,  0, 10, 10],
#                [10, 10, 20, 20]], format=BoundingBoxFormat.XYXY, canvas_size=(100, 100))
and the other transforms do support dictionary and arbitrarily nested inputs, just like SanitizeBoundingBoxes: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_getting_started.html#what-do-i-pass-as-input

You can definitely use SanitizeBoundingBoxes() within Compose() for example: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_e2e.html#transforms

It seems like something wasn't clear in the docs here - I'd love for you to let us know what could be improved on that front.

Finally, an important drawback of SanitizeBoundingBoxes is that it subsets the boxes and the labels together, but not other detection features such as area or iscrowd. This is error prone, and this transform should at least be able to accommodate the standard COCO format.

In general it's going to be untracktable to subset arbitrary inputs (and potentially inconvenient, since it would force those inputs to be in a specific expected format). But perhaps it would be enough let the functional return the subsampling maks (as suggested above), and let the user do the indexing themselves on their desired ipnuts (area, iscrowd, etc.) ?

Thanks for answering so quickly ! Well, indeed we can pass either a BoundingBoxes object to SanitizeBoundingBoxes, or a dict to other transforms, but then the mask information is lost.

Maybe a solution would be to have a "labels_getter" that could be in fact a list of dictionary keys for which to apply the mask ? It would be simple to implement and probably generic enough.

from vision.

NicolasHug commented on September 26, 2024

Summarizing an offline chat with @MarcSzafraniec :

A. Allow users to subsample arbitrary stuff

We need to allow users to subsample arbitrary stuff, not just the labels (e.g. "iscrowd" info, etc). This can be done in different ways, with the constraint that we don't want to enforce a specific input structure as this would go against a fundamental design principle of v2 (e.g. we don't want to enforce specific keys or even a dict structure).

One suggested solution would be to let labels_getter return a tuple of tensors. Each tensor in that tuple gets subsampled, just like the label tensor. The tuple entries technically don't need to be tensors, they just have to be anything that supports [bool_mask] indexing.

B. Having a functional

We've left out the functional implementation of SanitizeBoundingBoxes, but it can still be useful. This functional should be able to subsample the BoundingBoxes objects but also the labels and arbitrary tensors. I see 2 main ways of doing this for now (there's probably others), both of which deviate from what we typically call a functional (then again, this SanitizeBoundingBoxes is an outlier itself)

Option 1

def sanitize_bounding_boxes(boxes, *args):
    # subsample boxes and everything in *args

# User code:
boxes, labels, iscrowd = sanitize_bounding_boxes(boxes, labels, is_crowd)

Option 2

def get_sanitize_bouding_boxes_mask(boxes):
    # just return the boolean subsampling mask
	return is_valid

# User code:
valid_mask = get_sanitize_bouding_boxes_mask(boxes)
boxes, labels, iscrowd = boxes[valid_mask], labels[valid_mask], iscrowd[valid_mask]

Note: a way to get the valid_mask with option 1 is to pass boxes, valid_indices = sanitize_bounding_boxes(boxes, torch.arange(boxes.shape[0]))

EDIT: actually Option2 is not as pleasant as I thought because it forces users to call boxes = tv_tensors.wrap(...) to preserve the BoundingBoxes type.

CC @pmeier @vfdev-5 for your input on both design discussions here (A and B).

from vision.

pmeier commented on September 26, 2024

👍

I like option 1 best, but I don't think JIT supports *args. So this is another way this would deviate from our current implementation.

Plus, when you say "functional" here, do you mean an actual functional, i.e. TVTensor input or a kernel? In the former case how would you handle rewrapping?

Without breaking any of our current paradigms, let me propose a third option:

def sanitize_bounding_boxes(boxes):
    # compute subsampling mask and apply to boxes
    return boxes, is_valid

That way people can use the mask if needed without extra computation, and don't have to subsample their boxes themselves. If we treat this function as kernel rather than as "functional", we also don't have an issue with the tuple return. For boxes we already have kernels that return a tuple, with the first item always being the boxes and the remaining parts being metadata.

from vision.

MarcSzafraniec commented on September 26, 2024

I have an additional comment: why this check if min_size < 1: raise ValueError(f"min_size must be >= 1, got {min_size}.") ? Boxes could be relative, or even just floating point, in that case its possible to have a width or height < 1 ...

from vision.

NicolasHug commented on September 26, 2024

@MarcSzafraniec that's because in torchvision the expected format for the bounding boxes coordinates is absolute, not relative. The rest of the transforms also expect e.g. the XYXY format to be in absolute pixel coordinates.

from vision.

Cleaner SanitizeBoundingBoxes transform about vision HOT 6 CLOSED

Comments (6)

A. Allow users to subsample arbitrary stuff

B. Having a functional

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent