Comments (6)
Thanks a lot for the detailed feature request @MarcSzafraniec . This sounds reasonable overall, I shared more detailed thoughts below. LMK what you think!
A functional version of SanitizeBoundingBoxes
Sounds good. Not sure whether we should allow labels
and other keys here, or just bouding boxes, we'll see. Maybe we can get away by just returning the valid
mask and let the user subsample they input.
A version of SanitizeBoundingBoxes that is aligned with other detection transforms: it should either take a BoundingBoxes TVTensor, or other transforms should be able to take in a dictionary that contains boxes, among others.
All of this is already supported. I.e. SanitizeBoundingBoxes()
supports a single BoundingBoxes()
object as input:
from torchvision.transforms import v2
from torchvision import tv_tensors
import torch
bbox = tv_tensors.BoundingBoxes(torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]]), canvas_size=(100, 100), format="XYXY")
v2.SanitizeBoundingBoxes(labels_getter=None)(bbox)
# BoundingBoxes([[ 0, 0, 10, 10],
# [10, 10, 20, 20]], format=BoundingBoxFormat.XYXY, canvas_size=(100, 100))
and the other transforms do support dictionary and arbitrarily nested inputs, just like SanitizeBoundingBoxes
: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_getting_started.html#what-do-i-pass-as-input
You can definitely use SanitizeBoundingBoxes()
within Compose()
for example: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_e2e.html#transforms
It seems like something wasn't clear in the docs here - I'd love for you to let us know what could be improved on that front.
Finally, an important drawback of SanitizeBoundingBoxes is that it subsets the boxes and the labels together, but not other detection features such as area or iscrowd. This is error prone, and this transform should at least be able to accommodate the standard COCO format.
In general it's going to be untracktable to subset arbitrary inputs (and potentially inconvenient, since it would force those inputs to be in a specific expected format). But perhaps it would be enough let the functional return the subsampling maks (as suggested above), and let the user do the indexing themselves on their desired ipnuts (area, iscrowd, etc.) ?
from vision.
Thanks a lot for the detailed feature request @MarcSzafraniec . This sounds reasonable overall, I shared more detailed thoughts below. LMK what you think!
A functional version of SanitizeBoundingBoxes
Sounds good. Not sure whether we should allow
labels
and other keys here, or just bouding boxes, we'll see. Maybe we can get away by just returning thevalid
mask and let the user subsample they input.A version of SanitizeBoundingBoxes that is aligned with other detection transforms: it should either take a BoundingBoxes TVTensor, or other transforms should be able to take in a dictionary that contains boxes, among others.
All of this is already supported. I.e.
SanitizeBoundingBoxes()
supports a singleBoundingBoxes()
object as input:from torchvision.transforms import v2 from torchvision import tv_tensors import torch bbox = tv_tensors.BoundingBoxes(torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]]), canvas_size=(100, 100), format="XYXY") v2.SanitizeBoundingBoxes(labels_getter=None)(bbox) # BoundingBoxes([[ 0, 0, 10, 10], # [10, 10, 20, 20]], format=BoundingBoxFormat.XYXY, canvas_size=(100, 100))and the other transforms do support dictionary and arbitrarily nested inputs, just like
SanitizeBoundingBoxes
: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_getting_started.html#what-do-i-pass-as-inputYou can definitely use
SanitizeBoundingBoxes()
withinCompose()
for example: https://pytorch.org/vision/main/auto_examples/transforms/plot_transforms_e2e.html#transformsIt seems like something wasn't clear in the docs here - I'd love for you to let us know what could be improved on that front.
Finally, an important drawback of SanitizeBoundingBoxes is that it subsets the boxes and the labels together, but not other detection features such as area or iscrowd. This is error prone, and this transform should at least be able to accommodate the standard COCO format.
In general it's going to be untracktable to subset arbitrary inputs (and potentially inconvenient, since it would force those inputs to be in a specific expected format). But perhaps it would be enough let the functional return the subsampling maks (as suggested above), and let the user do the indexing themselves on their desired ipnuts (area, iscrowd, etc.) ?
Thanks for answering so quickly ! Well, indeed we can pass either a BoundingBoxes
object to SanitizeBoundingBoxes
, or a dict to other transforms, but then the mask
information is lost.
Maybe a solution would be to have a "labels_getter" that could be in fact a list of dictionary keys for which to apply the mask ? It would be simple to implement and probably generic enough.
from vision.
Summarizing an offline chat with @MarcSzafraniec :
A. Allow users to subsample arbitrary stuff
We need to allow users to subsample arbitrary stuff, not just the labels (e.g. "iscrowd" info, etc). This can be done in different ways, with the constraint that we don't want to enforce a specific input structure as this would go against a fundamental design principle of v2 (e.g. we don't want to enforce specific keys or even a dict structure).
One suggested solution would be to let labels_getter
return a tuple of tensors. Each tensor in that tuple gets subsampled, just like the label tensor. The tuple entries technically don't need to be tensors, they just have to be anything that supports [bool_mask]
indexing.
B. Having a functional
We've left out the functional implementation of SanitizeBoundingBoxes
, but it can still be useful. This functional should be able to subsample the BoundingBoxes
objects but also the labels and arbitrary tensors. I see 2 main ways of doing this for now (there's probably others), both of which deviate from what we typically call a functional
(then again, this SanitizeBoundingBoxes
is an outlier itself)
Option 1
def sanitize_bounding_boxes(boxes, *args):
# subsample boxes and everything in *args
# User code:
boxes, labels, iscrowd = sanitize_bounding_boxes(boxes, labels, is_crowd)
Option 2
def get_sanitize_bouding_boxes_mask(boxes):
# just return the boolean subsampling mask
return is_valid
# User code:
valid_mask = get_sanitize_bouding_boxes_mask(boxes)
boxes, labels, iscrowd = boxes[valid_mask], labels[valid_mask], iscrowd[valid_mask]
Note: a way to get the valid_mask
with option 1 is to pass boxes, valid_indices = sanitize_bounding_boxes(boxes, torch.arange(boxes.shape[0]))
EDIT: actually Option2 is not as pleasant as I thought because it forces users to call boxes = tv_tensors.wrap(...)
to preserve the BoundingBoxes type.
CC @pmeier @vfdev-5 for your input on both design discussions here (A and B).
from vision.
A:
👍
B:
I like option 1 best, but I don't think JIT supports *args
. So this is another way this would deviate from our current implementation.
Plus, when you say "functional" here, do you mean an actual functional, i.e. TVTensor
input or a kernel? In the former case how would you handle rewrapping?
Without breaking any of our current paradigms, let me propose a third option:
def sanitize_bounding_boxes(boxes):
# compute subsampling mask and apply to boxes
return boxes, is_valid
That way people can use the mask if needed without extra computation, and don't have to subsample their boxes themselves. If we treat this function as kernel rather than as "functional", we also don't have an issue with the tuple return. For boxes we already have kernels that return a tuple, with the first item always being the boxes and the remaining parts being metadata.
from vision.
I have an additional comment: why this check if min_size < 1: raise ValueError(f"min_size must be >= 1, got {min_size}.")
? Boxes could be relative, or even just floating point, in that case its possible to have a width or height < 1 ...
from vision.
@MarcSzafraniec that's because in torchvision the expected format for the bounding boxes coordinates is absolute, not relative. The rest of the transforms also expect e.g. the XYXY format to be in absolute pixel coordinates.
from vision.
Related Issues (20)
- Add vision-language models HOT 1
- Add mobilenetv4 support and pretrained models? HOT 5
- Allow passing a file-like object to torchvision.io.video.read_video HOT 6
- cutmix alpha argument in references/classification/transforms.py HOT 1
- torchvision manywheel py 3.11, cuda failure HOT 6
- MacOS test job are failing and install `torch-2.2.0.dev20231010`
- Let `v2.functional.gaussian_blur` backprop through `sigma` parameter HOT 3
- TestLinearTransform.test_transform[cpu-dtype1-make_image_tensor] is flaky
- M1 jobs are all failing HOT 2
- mean/variance naming consistency
- bad_alloc when using Conv2d after torchvision.io.read_video
- Support NumPy 2.0 HOT 3
- Torchvision nightly not being published for Linux, Python 3.11. CUDA 11.8 HOT 2
- SBDataset downloads fails because train_noval.txt is no more available at the specified url HOT 4
- Put back MPS builds HOT 6
- M1 conda-py3_11-cpu failing HOT 1
- Batched (random) augmentations HOT 3
- nconsistency between torchvision.datasets.cifar.CIFAR10.data return and torchvision.datasets.cifar.CIFAR10.__getitem__() HOT 3
- Efficientnet_b3 pretrained model load with invalid hash value HOT 1
- Swin Transformer stochastic_depth_prob arguments HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vision.