Code Monkey home page Code Monkey logo

dinov's Introduction

Visual In-Context Prompting

๐Ÿ‡ [Read our arXiv Paper] ย  ๐ŸŽ [Try our Demo]

In this work, we introduce DINOv, a Visual In-Context Prompting framework for referring and generic segmentation tasks.

For visualization and demos, we also recommend trying T-Rex demo link, which is another visual prompting tool in our team with similar properties as DINOv.

teaser

๐Ÿ› ๏ธ Installation

pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
pip install git+https://github.com/cocodataset/panopticapi.git
git clone https://github.com/UX-Decoder/DINOv
cd DINOv
python -m pip install -r requirements.txt

๐Ÿ‘‰ Launch a demo for visual in-context prompting

python demo_openset.py --ckpt /path/to/swinL/ckpt

Openset segmentation

generic_seg_vis

Panoptic segmentation

panoptic_vis

๐Ÿ‘‰: Related projects:

  • Semantic-SAM: We base on the mutli-granularity interactive segmentation to extract proposals.
  • Mask DINO: We build upon Mask DINO which is a unified detection and segmentation model to implement our model.
  • SEEM: Segment using a wide range of user prompts.

๐Ÿฆ„ Getting Started

๐Ÿ•Œ Data preparation

We jointly train on COCO and SA-1B data. Please refer to prepare SA-1B data and prepare coco data.

For evaluation, you need to prepare

  • ADE20K for open-set segmentation evaluation.
  • DAVIS2017 for refering segmentation (video object segmentation).

๐ŸŒ‹ Model Zoo

The currently released checkpoints are trained with SA-1B and COCO data.

Name Training Dataset Backbone PQ (COCO) PQ (ADE) download
DINOv | config SA-1B, COCO SwinT 49.0 19.4 model
DINOv | config SA-1B, COCO SwinL 57.7 23.2 model

๐ŸŒป Evaluation

We do detection evaluation on COCO val2017. $n is the number of gpus you use

Process visual prompt embeddings for inference. We calculate the all the instance prompt embeddings of the validate set (you can also use the training set, but the processing time is much longer) and store them. Then we infrence by randomly selecting some visual prompts as in-context examples.

Evaluate Open-set detection and segmentation

  • Infenrence script to get and store visual prompts
python train_net.py --eval_only --resume --eval_get_content_features --num-gpus 8 --config-file /path/to/configs COCO.TEST.BATCH_SIZE_TOTAL=8 MODEL.WEIGHTS=/path/to/weights OUTPUT_DIR=/path/to/outputs
  • Inference script for open-set detection on COCO with visual prompts
python train_net.py --eval_only --resume --eval_visual_openset --num-gpus 8 --config-file /path/to/configs COCO.TEST.BATCH_SIZE_TOTAL=8 MODEL.WEIGHTS=/path/to/weights MODEL.DECODER.INFERENCE_EXAMPLE=16 OUTPUT_DIR=/path/to/outputs
  • configs to use are configs/dinov_sam_coco_train.yaml for swinT and configs/dinov_sam_coco_swinl_train.yaml for swinL.
  • For ADE20K data, use configs/dinov_sam_ade_eval.yaml and adjust the batchsize of ADE evaluation to the correct number.
  • OUTPUT_DIR is the dir to store the visual prompt embeddings
  • INFERENCE_EXAMPLE number of in-context examples to represent a category. Default set to 16.

Evaluate Refering segmentation on VOS

We evaluate under the DAVIS 2017 Semi-supervised setting, please refer to davis2017-evaluation for more details.

The first step is to compute and store the results of DAVIS2017. We implement a navie memory-aware approach with our in-context visual prompting.

python train_net.py --eval_track_prev --eval_only --resume --num-gpus 8 --config-file configs/dinov_sam_coco_train.yaml DAVIS.TEST.BATCH_SIZE_TOTAL=8 OUTPUT_DIR=$outdir MODEL.WEIGHTS=/path/to/weights MODEL.DECODER.NMS_THRESHOLD=0.9 MODEL.DECODER.MAX_MEMORY_SIZE=9 OUTPUT_DIR=/path/to/outputs

The second step is to evaluate the semi-supervised results.

python evaluation_method.py --task semi-supervised --results_path /path/to/results --davis_path /path/to/davis/data
  • We use MAX_MEMORY_SIZE = 9 by default (1 current frame token and 8 previous memory tokens)

โญ Training

We currently release the code of training on SA-1B and COCO. It can also support Objects365 and other datasets with minimal modifications. $n is the number of gpus you use before running the training code, you need to specify your training data of SA-1B.

export DETECTRON2_DATASETS=/pth/to/cdataset  # path to coco, ade
export SAM_DATASET=/pth/to/sam_dataset  # patch to sa-1b data
export SAM_DATASET_START=$start
export SAM_DATASET_END=$end

We convert SA-1B data into 100 tsv files. start(int, 0-99) is the start of your SA-1B data index and end(int, 0-99) is the end of your data index. You can refer to Semantic-SAM json registration for SAM for a reference on the data preparation.

We recommend using total batchsize 64 for training, which provides enough postive and negative samples for contrastive learning.

For SwinT backbone

python train_net.py --resume --num-gpus 8 --config-file configs/dinov_sam_coco_train.yaml SAM.TRAIN.BATCH_SIZE_TOTAL=8 COCO.TRAIN.BATCH_SIZE_TOTAL=8

For SwinL backbone

python train_net.py --resume --num-gpus 8 --config-file configs/dinov_sam_coco_swinl_train.yaml SAM.TRAIN.BATCH_SIZE_TOTAL=8 COCO.TRAIN.BATCH_SIZE_TOTAL=8
  • Please use multi-node training, i.e, 64 gpu for batchsize 64, where each gpu handle one SA-1B image and one coco image.
  • By default, we do not use COCO data for referring segmentation training. You can set MODEL.DECODER.COCO_TRACK=True to enable this task, which can improve the referring segmentation performance on DAVIS.
  • We did not implement multi-image training for this task, which mean you can only put one image on a gpu for each datatype (i.e., one SA-1b and one COCO image).

Model framework

framework query_formulation

Results

Open-set detection and segmentation

image

Video object segmentation

image

โœ’๏ธ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{li2023visual,
  title={Visual In-Context Prompting},
  author={Li, Feng and Jiang, Qing and Zhang, Hao and Ren, Tianhe and Liu, Shilong and Zou, Xueyan and Xu, Huaizhe and Li, Hongyang and Li, Chunyuan and Yang, Jianwei and others},
  journal={arXiv preprint arXiv:2311.13601},
  year={2023}
}


dinov's People

Contributors

fengli-ust avatar power0341 avatar maureenzou avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.