Code Monkey home page Code Monkey logo

dinov2's Introduction

πŸ†• [2023-10-26] Added DINOv2 backbones with registers, following Vision Transformers Need Registers.

DINOv2: Learning Robust Visual Features without Supervision

Meta AI Research, FAIR

Maxime Oquab, TimothΓ©e Darcet, ThΓ©o Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Patrick Labatut, Armand Joulin, Piotr Bojanowski

[Paper #1] Paper #2] [Blog] [Demo] [BibTeX]

PyTorch implementation and pretrained models for DINOv2. For details, see the papers: DINOv2: Learning Robust Visual Features without Supervision and Vision Transformers Need Registers.

DINOv2 models produce high-performance visual features that can be directly employed with classifiers as simple as linear layers on a variety of computer vision tasks; these visual features are robust and perform well across domains without any requirement for fine-tuning. The models were pretrained on a dataset of 142 M images without using any labels or annotations.

video-reference+dinov2.mp4
Visualization of the three first principal components of the patch features of all frames, mapped to RGB values.

Pretrained models

model # of
params
with
registers
ImageNet
k-NN
ImageNet
linear
download
ViT-S/14 distilled 21 M ❌ 79.0% 81.1% backbone only
ViT-S/14 distilled 21 M βœ… 79.1% 80.9% backbone only
ViT-B/14 distilled 86 M ❌ 82.1% 84.5% backbone only
ViT-B/14 distilled 86 M βœ… 82.0% 84.6% backbone only
ViT-L/14 distilled 300 M ❌ 83.5% 86.3% backbone only
ViT-L/14 distilled 300 M βœ… 83.8% 86.7% backbone only
ViT-g/14 1,100 M ❌ 83.5% 86.5% backbone only
ViT-g/14 1,100 M βœ… 83.7% 87.1% backbone only

Pretrained backbones (via PyTorch Hub)

Please follow the instructions here to install PyTorch (the only required dependency for loading the model). Installing PyTorch with CUDA support is strongly recommended.

A corresponding model card is included in the repository.

import torch

# DINOv2
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

# DINOv2 with registers
dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
dinov2_vitb14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
dinov2_vitl14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg')
dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')

Pretrained heads - Image classification

backbone with
registers
download
ImageNet
ViT-S/14 distilled ❌ linear head (1 layer, 4 layers)
ViT-S/14 distilled βœ… linear head (1 layer, 4 layers)
ViT-B/14 distilled ❌ linear head (1 layer, 4 layers)
ViT-B/14 distilled βœ… linear head (1 layer, 4 layers)
ViT-L/14 distilled ❌ linear head (1 layer, 4 layers)
ViT-L/14 distilled βœ… linear head (1 layer, 4 layers)
ViT-g/14 ❌ linear head (1 layer, 4 layers)
ViT-g/14 βœ… linear head (1 layer, 4 layers)

The (full) classifier models can be loaded via PyTorch Hub:

import torch

# DINOv2
dinov2_vits14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_lc')
dinov2_vitb14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_lc')
dinov2_vitl14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_lc')
dinov2_vitg14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_lc')

# DINOv2 with registers
dinov2_vits14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg_lc')
dinov2_vitb14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg_lc')
dinov2_vitl14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg_lc')
dinov2_vitg14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg_lc')

Pretrained heads - Depth estimation

backbone download head
NYUd KITTI
ViT-S/14 distilled linear (1 layer, 4 layers), DPT linear (1 layer, 4 layers), DPT
ViT-B/14 distilled linear (1 layer, 4 layers), DPT linear (1 layer, 4 layers), DPT
ViT-L/14 distilled linear (1 layer, 4 layers), DPT linear (1 layer, 4 layers), DPT
ViT-g/14 linear (1 layer, 4 layers), DPT linear (1 layer, 4 layers), DPT

Pretrained heads - Semantic segmentation

backbone download model download head
ADE20K ADE20K VOC2012
ViT-S/14 distilled linear, multi-scale linear, multi-scale
ViT-B/14 distilled linear, multi-scale linear, multi-scale
ViT-L/14 distilled linear, multi-scale linear, multi-scale
ViT-g/14 Mask2Former linear, multi-scale linear, multi-scale

Installation

The training and evaluation code requires PyTorch 2.0 and xFormers 0.0.18 as well as a number of other 3rd party packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup all the required dependencies for training and evaluation, please follow the instructions below:

conda (Recommended) - Clone the repository and then create and activate a dinov2 conda environment using the provided environment definition:

conda env create -f conda.yaml
conda activate dinov2

pip - Clone the repository and then use the provided requirements.txt to install the dependencies:

pip install -r requirements.txt

For dense tasks (depth estimation and semantic segmentation), there are additional dependencies (specific versions of mmcv and mmsegmentation) which are captured in the extras dependency specifications:

conda (Recommended):

conda env create -f conda-extras.yaml
conda activate dinov2-extras

pip:

pip install -r requirements.txt -r requirements-extras.txt

Data preparation

ImageNet-1k

The root directory of the dataset should hold the following contents:

  • <ROOT>/test/ILSVRC2012_test_00000001.JPEG
  • <ROOT>/test/[..]
  • <ROOT>/test/ILSVRC2012_test_00100000.JPEG
  • <ROOT>/train/n01440764/n01440764_10026.JPEG
  • <ROOT>/train/[...]
  • <ROOT>/train/n15075141/n15075141_9993.JPEG
  • <ROOT>/val/n01440764/ILSVRC2012_val_00000293.JPEG
  • <ROOT>/val/[...]
  • <ROOT>/val/n15075141/ILSVRC2012_val_00049174.JPEG
  • <ROOT>/labels.txt

The provided dataset implementation expects a few additional metadata files to be present under the extra directory:

  • <EXTRA>/class-ids-TRAIN.npy
  • <EXTRA>/class-ids-VAL.npy
  • <EXTRA>/class-names-TRAIN.npy
  • <EXTRA>/class-names-VAL.npy
  • <EXTRA>/entries-TEST.npy
  • <EXTRA>/entries-TRAIN.npy
  • <EXTRA>/entries-VAL.npy

These metadata files can be generated (once) with the following lines of Python code:

from dinov2.data.datasets import ImageNet

for split in ImageNet.Split:
    dataset = ImageNet(split=split, root="<ROOT>", extra="<EXTRA>")
    dataset.dump_extra()

Note that the root and extra directories do not have to be distinct directories.

ImageNet-22k

Please adapt the dataset class to match your local setup.


⚠️ To execute the commands provided in the next sections for training and evaluation, the dinov2 package should be included in the Python module search path, i.e. simply prefix the command to run with PYTHONPATH=..

Training

Fast setup: training DINOv2 ViT-L/16 on ImageNet-1k

Run DINOv2 training on 4 A100-80GB nodes (32 GPUs) in a SLURM cluster environment with submitit:

python dinov2/run/train/train.py \
    --nodes 4 \
    --config-file dinov2/configs/train/vitl16_short.yaml \
    --output-dir <PATH/TO/OUTPUT/DIR> \
    train.dataset_path=ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>

Training time is approximately 1 day and the resulting checkpoint should reach 81.6% on k-NN eval and 82.9% on linear eval.

The training code saves the weights of the teacher in the eval folder every 12500 iterations for evaluation.

Long setup: training DINOv2 ViT-L/14 on ImageNet-22k

Run DINOv2 training on 12 A100-80GB nodes (96 GPUs) in a SLURM cluster environment with submitit:

python dinov2/run/train/train.py \
    --nodes 12 \
    --config-file dinov2/configs/train/vitl14.yaml \
    --output-dir <PATH/TO/OUTPUT/DIR> \
    train.dataset_path=ImageNet22k:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>

Training time is approximately 3.3 days and the resulting checkpoint should reach 82.0% on k-NN eval and 84.5% on linear eval.

The training code saves the weights of the teacher in the eval folder every 12500 iterations for evaluation.

Evaluation

The training code regularly saves the teacher weights. In order to evaluate the model, run the following evaluation on a single node:

k-NN classification on ImageNet-1k

python dinov2/run/eval/knn.py \
    --config-file <PATH/TO/OUTPUT/DIR>/config.yaml \
    --pretrained-weights <PATH/TO/OUTPUT/DIR>/eval/training_24999/teacher_checkpoint.pth \
    --output-dir <PATH/TO/OUTPUT/DIR>/eval/training_24999/knn \
    --train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
    --val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>

Logistic regression classification on ImageNet-1k

python dinov2/run/eval/log_regression.py \
    --config-file <PATH/TO/OUTPUT/DIR>/config.yaml \
    --pretrained-weights <PATH/TO/OUTPUT/DIR>/eval/training_24999/teacher_checkpoint.pth \
    --output-dir <PATH/TO/OUTPUT/DIR>/eval/training_24999/logreg \
    --train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
    --val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>

Linear classification with data augmentation on ImageNet-1k

python dinov2/run/eval/linear.py \
    --config-file <PATH/TO/OUTPUT/DIR>/config.yaml \
    --pretrained-weights <PATH/TO/OUTPUT/DIR>/eval/training_24999/teacher_checkpoint.pth \
    --output-dir <PATH/TO/OUTPUT/DIR>/eval/training_24999/linear \
    --train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
    --val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>

We release the weights from evaluating the different models:

model with
registers
ImageNet
top-1
linear evaluation
ViT-S/14 distilled ❌ 81.1% linear head weights
ViT-S/14 distilled βœ… 80.8% linear head weights
ViT-B/14 distilled ❌ 84.5% linear head weights
ViT-B/14 distilled βœ… 84.4% linear head weights
ViT-L/14 distilled ❌ 86.3% linear head weights
ViT-L/14 distilled βœ… 86.5% linear head weights
ViT-g/14 ❌ 86.5% linear head weights
ViT-g/14 βœ… 87.0% linear head weights

The performance of the provided pretrained model weights can be evaluated as follows on ImageNet-1k:

python dinov2/run/eval/linear.py \
    --config-file dinov2/configs/eval/vitg14_pretrain.yaml \
    --pretrained-weights https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth \
    --train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
    --val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>

Notebooks

A few notebooks are provided to help the community leverage the models and code:

  • Depth estimation - How to load and use the depth heads in combination with a matching backbone via mmcv
  • Semantic segmentation - How to load and use the segmentation heads in combination with a matching backbone via mmcv, and also how to load and use the Mask2Former-based segmentation model trained on ADE20K

License

DINOv2 code and model weights are released under the Apache License 2.0. See LICENSE for additional details.

Contributing

See contributing and the code of conduct.

Citing DINOv2

If you find this repository useful, please consider giving a star ⭐ and citation πŸ¦–:

@misc{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, TimothΓ©e and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
  journal={arXiv:2304.07193},
  year={2023}
}
@misc{darcet2023vitneedreg,
  title={Vision Transformers Need Registers},
  author={Darcet, TimothΓ©e and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr},
  journal={arXiv:2309.16588},
  year={2023}
}

dinov2's People

Contributors

aryanutkarsh avatar goggle avatar leo-gan avatar patricklabatut avatar qasfb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dinov2's Issues

possible bug in _LinearClassifierWrapper of hubconf.py?

I think there might be a bug in the forward pass for the _LinearClassifierWrapper of hubconf.py, such that it throws an error when passing a batch of images through any of the linear classifier models (e.g., dinov2_vits14_lc).

Here's a quick test, where expected output size 10x1000 (bs x num_classes), but this throws an error:

import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
classifier = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_lc', layers=4, pretrained=True)
classifier.to(device)

classifier.eval()
with torch.no_grad():
    x = torch.rand(10,3,224,224).to(device)
    out = classifier(x)
print(out.shape)

==> RuntimeError: mat1 and mat2 shapes cannot be multiplied (296x384 and 1920x1000)

I believe the issue is caused by this bit of the forward function of the _LinearClassifierWrapper class

...
            linear_input = torch.cat([
                x[0][1].squeeze(0),
                x[1][1].squeeze(0),
                x[2][1].squeeze(0),
                x[3][1].squeeze(0),
                x[3][0].squeeze(0).mean(0)
            ])
...            

This squeezes the batch dimension out, but I think it should instead be the following to retain the batch dimension and concatenate along the 1st dimension :

            linear_input = torch.cat([
                x[0][1],
                x[1][1],
                x[2][1],
                x[3][1],
                x[3][0].mean(1)
            ], dim=1)

After making this change I'm able to reproduce the linear evaluation top1 accuracy scores reported in the README.md

Here's the patch I used to test this fix:

import torch

from functools import partial

classifier = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_lc', layers=4, pretrained=True)

def forward(self, x):
        if self.layers == 1:
            x = self.backbone.forward_features(x)
            cls_token = x["x_norm_clstoken"].squeeze(0)
            patch_tokens = x["x_norm_patchtokens"].squeeze(0)
            linear_input = torch.cat([
                cls_token,
                patch_tokens.mean(0)
            ])
        elif self.layers == 4:
            x = self.backbone.get_intermediate_layers(x, n=4, return_class_token=True)
            linear_input = torch.cat([
                x[0][1],
                x[1][1],
                x[2][1],
                x[3][1],
                x[3][0].mean(1)
            ], dim=1)
        else:
            assert False, f"Unsupported number of layers: {self.layers}"
        return self.linear_head(linear_input)
classifier.forward = partial(forward, classifier)
classifier.to(device);

Triton and xformer errors?

I am getting this error:
Operator wasn't built - see python -m xformers.info for more info
triton is not available
smallkF is not supported because:
max(query.shape[-1] != value.shape[-1]) > 32
unsupported embed per head: 64

I tried to build xformers from source and then tried to install triton as well, its not working, does anyone faced this issue. I'm on Windows btw

Semantic segmentation

I'm not able to find code for Semantic segmentation. In the paper it's written that:

 a linear layer is trained to predict class logits from a patch tokens. It is used to produce a low-
resolution logit map (eg 32x32 for a model with patch size 16), which is then upsampled to full resolution
(512x512) to obtain a segmentation map. 

Does this mean a Linear layer with 32*32 = 1024 output classes need to be trained? What about n_last_blocks_list = [1, 4] and n_last_blocks = max(n_last_blocks_list) ? Does that need to be changed to n_last_blocks_list = [1, 1] and n_last_blocks = max(n_last_blocks_list) ?

Is there any sample code for semantic segmentation ?

[need advice]: Usage as backbone for CNN

Hi, amazing work you've done here!

I would like to use this as a backbone in a cnn-based Autoencoder.
I have no experience with transformers (.. except from interfering with them in chatgpt), so i would be pleased by any advice on how to replace my current backbone (resnet-50v2, first 3 layers) with dinov2!

[Feature Request] ViT Backbone on Segment Anything Dataset

Thanks a lot for your efforts to open source the code and checkpoint!!!

Recently, another team from Meta also releases a strong semantic segmentation dataset/model :0 - SegmentAnything.

I think a dinov2 ViT backbone finetuned on SegmentAnything would be of great help for the CV community since it combines
best self-supervised and supervised model in the world :). Do you guys have any plan to try it and release the backbone weights? For starter, ViT-Base/Large would be especially accessible for individual/academic researchers to use.

Thanks again for your time!

No module named 'dinov2

I was having trouble running dinov2 after running

I would appreciate it if you could let me know if there is a solution

(dinov2) abe@ganesa:~/kuma-ssl/dinov2$ python dinov2/run/train/train.py --nodes 1 --config-file dinov2/configs/train/vitl16_short.yaml
Traceback (most recent call last):
  File "/home/abe/kuma-ssl/dinov2/dinov2/run/train/train.py", line 11, in <module>
    from dinov2.logging import setup_logging
ModuleNotFoundError: No module named 'dinov2'
(dinov2) abe@ganesa:~/kuma-ssl/dinov2$ ls
CODE_OF_CONDUCT.md  LICENSE        README.md   demo.py  hubconf.py      requirements-dev.txt  scripts    setup.py
CONTRIBUTING.md     MODEL_CARD.md  conda.yaml  dinov2   pyproject.toml  requirements.txt      setup.cfg
(dinov2) abe@ganesa:~/kuma-ssl/dinov2$ cd dinov2/
(dinov2) abe@ganesa:~/kuma-ssl/dinov2/dinov2$ ls
__init__.py  __pycache__  configs  data  distributed  eval  fsdp  layers  logging  loss  models  run  train  utils

help

Sorry, I installed -- extra index URL https://pypi.nvidia.com
CUML CU11 generated this error

pip install --extra-index-url https://pypi.nvidia.com cuml_cu11
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cuml_cu11
Using cached cuml_cu11-23.4.0.1681368248.tar.gz (6.8 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

Γ— python setup.py egg_info did not run successfully.
β”‚ exit code: 1
╰─> [16 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "C:\Users\Lenovo\AppData\Local\Temp\pip-install-lp6867xq\cuml-cu11_29831863f455450faf83190fcc7a7e3c\setup.py", line 137, in
raise RuntimeError(open("ERROR.txt", "r").read())
RuntimeError:
###########################################################################################
The package you are trying to install is only a placeholder project on PyPI.org repository.
This package is hosted on NVIDIA Python Package Index.

  This package can be installed as:
  ```
  $ pip install --extra-index-url https://pypi.nvidia.com cuml_cu11
  ```
  ###########################################################################################

  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Γ— Encountered error while generating package metadata.
╰─> See above for output.

What is the problem?

How to do image feature extractions?

Hi there!

Is there any documentation or guide on how to use this model as a feature extractor?
I'd like to feed it some images and receive features of those images.

Thanks.

Update dependencies version to >= instead of strict equal

Hi,
I'm trying to install Dinov2 from poetry.
Currently the version checking fails because torchvision is at 0.15.1 and you put 0.15.0 as a dependency.
Could you please update the dependencies to that it works with future versions?

Thank you

poetry add git+https://github.com/facebookresearch/dinov2.git

Updating dependencies
Resolving dependencies... (0.0s)

Because dinov2 (0.0.1) @ git+https://github.com/facebookresearch/dinov2.git@HEAD depends on torchvision (0.15.0)
 and myproject depends on torchvision (^0.15.1), dinov2 is forbidden

Extend to input a sequence of frames

Hi,
I was wondering if there is an easy way to incorporate a sequence of frames while being able to use the pretrained weights (i.e., w/o retraining the network). I do not aim to have aggregated embeddings of multiple frames by passing each frame independently. I am looking to have native video-level inputs.

Thanks in advance!

Export to ONNX supported?

I'm trying to export the model into .onnx.

Here's my code:

import torch
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14').to('cuda:0')
model.eval()

# Generate some input data
input_data = torch.randn(1, 3, 224, 224).to('cuda:0')

# Pass the input data through the model
output = model(input_data)

torch.onnx.export(model, input_data, 'model.onnx')

I got an error

============= Diagnostic Run torch.onnx.export version 2.0.0+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-17-2f8453f4374c>](https://localhost:8080/#) in <cell line: 1>()
----> 1 torch.onnx.export(model, input_data, 'model.onnx')

13 frames
[~/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/models/vision_transformer.py](https://localhost:8080/#) in prepare_tokens_with_masks(self, x, masks)
    193         x = self.patch_embed(x)
    194         if masks is not None:
--> 195             x = torch.where(masks.unsqueeze(-1), self.mask_token.to(x.dtype).unsqueeze(0), x)
    196 
    197         x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Did I miss anything?

How to finetune on downstream depth estimation task?

Thanks for sharing your work! I am trying to incorporate your encoder into a depth estimation task, but I haven’t found any relevant code.
The paper said,

.DPT: we use the DPT decoder (Ranftl et al., 2021) on top of our frozen models and setup a regression task. We scale the size of the head following the dimension of the features for each architecture. We show results for all baselines, all datasets and all setups in Table 11

Can you share a simple example to guide me on how to use your encoder for depth estimation? Thanks again.

How to use DINOv2 pretrained ViT model for Downstream Task ?

Thanks for your great work and it impresses me !
I wanna to have a try in my research.
Specificly, I wanna to use ViT-small to replace the ImageNet pretrained backbone for monocular 3D object detection task. The parameters of. these two networks are comparable and I thought the performance of DINOv2 pretrained ViT-small would be higher.
However, the result shows that the performance of DINOv2 pretrained ViT-small is 20% lower, and the loss is hard to converge.
Since I have fine-tuned learning rate , whatelse can I do to make the ViT backbone avaible ?

fine-tuning to downstram task

Your work is so awesome. I am migrating vit-small to my task as a feature extractor, is this the right way to load pretrained-weight below. But it did not succeed in my task. But Dino worked well.

`# dinov2 model
model = vit_small(img_size = 224,block_chunks=0,mask = None,drop_path_rate=0.1,patch_size=14)
static_dict = torch.hub.load_state_dict_from_url('https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth')
interpolate_pos_embed(model,static_dict) # position adaption

msg = model.load_state_dict(static_dict,strict=False)

print('miss keys: ',msg.missing_keys)
print('unpxpected keys: ',msg.unexpected_keys)`

Will OCR be possible?

I want to use this as a general purpose backbone for multiple downstream taskw, dors it already perform well on OCR and text recognition tasks?

Dataset

Hi there πŸ‘‹

Will the dataset ever be shared?

Thanks a lot

Fra

Confused of linear segmentation

Your work is impressive and thanks for your code release.

I got a question about linear semantic segmentation. In your paper, an upsampling operation is after the linear layer, is it just a interpolate operation like F.interpolate(), also mentioned in #25 ? If it is, from my understanding, it is an interpolation of the class probability computed by linear layer, is it right?

Can cuda11.6 run torch==2.0.0+cu117 properly?

The environment required to install this project is cu11.7, but it seems that my 3060 graphics card can only support up to cu11.6. Does this mean I cannot use torch==2.0.0+cu117?image

Details of patch matching

Thanks so much for this inspiring and excellent work!
I implemented the patch matching but did not perform as well as the demo on the paper. Could you introduce more details of patch matching or release the code?

How to do Instance Retrieval like the demo?

Issue Title: How to do Instance Retrieval like the demo?

Issue Description:

Hello, I have recently come across the demo on instance retrieval in this repository, and I'm very interested in implementing a similar feature in my own project. However, I'm having some difficulties understanding the exact steps and required components to achieve this.

Can you please provide some guidance or documentation on how to replicate the instance retrieval functionality as demonstrated in the demo? Specifically, I'd like to know about the following:

  1. Which part of the codebase is responsible for instance retrieval?
  2. Any examples or tutorials that could help me better understand the implementation?

I appreciate any help or direction you can provide. Thank you!

Question about instance recognition metrics

Hi,

In Table 9: Evaluation of frozen features on instance-level recognition. of the table, it shows the performance for OpenCLIP-G/14 is 50.7 for Oxford-M and 19.7 for Oxford-H. However, we only get 39.4 for Oxford-M and 11.7 for Oxford-H (even without 1M distractors) using the evaluation code https://github.com/filipradenovic/revisitop/blob/master/python/evaluate.py#L39

Also tried revisit-oxford (without 1M distractors) the Dinov2-B14 distilled backbone with make_classification_eval_transform() transform in this repo, the metrics I get is 0.58 for Oxford-M and 0.337 for Oxford-H, which seems much lower than the number reported in the paper 0.729 for Oxford-M and 0.495 for Oxford-H.

If possible, could you help clarify:

  1. what metrics you are reporting in the paper, is it mean average precision or mean precision at kappas?
  2. Are you including the 1M distractors in the eval?
  3. what transform I should use the released backbone?

Similar for met, we also cannot reproduce the eval metrics for both OpenCLIP-G/14 and Dinov2-B14.

It will be great if you could provide the code to run on eval sets or the embedding generated!

Thanks!

potential type mismatch in eval mode?

1.) vision_transformer forward does self.head(ret["x_norm_clstoken"]) in eval, where ret is the result of forward_features

def forward(self, *args, is_training=False, **kwargs):

2.) the value of forward_features can be the result of forward_features_list

return self.forward_features_list(x, masks)

3.) forward_features_list returns a list of dictionaries

So, ret["x_norm_clstoken"] cannot work since ret is a list of outputs, not an individual output.

How to get intermediate image features like from Swin Transformers?

When I pass 1024px size and get intermediate image features from Swin Transform I get in return image feature of sizes:

torch.Size([1, 128, 256, 256])
torch.Size([1, 128, 256, 256])
torch.Size([1, 256, 128, 128])
torch.Size([1, 512, 64, 64])
torch.Size([1, 1024, 32, 32])

How do I get something like this from dinov2?

High resolution image result with NaN features

Hello,

I'm having an issue with Dinov2 while trying to use it with high-resolution images like the one available at this link. The problem is that the features returned by the model contain NaN values. This issue occurs with all four available models and is consistently present for images around the same size.

I would like to know if you have any ideas about what could be causing this problem. Here's an minimal example:

import torch
import numpy as np
import torchvision.transforms as T
from PIL import Image
import hubconf

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dino = hubconf.dinov2_vits14().to(device)  # Same issue with larger model
img = Image.open('4k.png')
pw, ph = np.array(img.size) // 14

transform = T.Compose([
    T.Resize((14 * ph, 14 * pw), interpolation=T.InterpolationMode.BICUBIC),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])

tensor = transform(img)[:3].unsqueeze(0).to(device)
with torch.no_grad():
    features = dino.forward_features(tensor)['x_norm_patchtokens'][0]

print(features)  # NaN

ConvNext

Hello,

Have you considered using the ConvNext architecture for training DINOv2?

ConvNext has shown to have improved performance and lower latency on tasks such as CLIP. For example, in the open_clip repository, ConvNext-L@320 achieves better results with a +1.4% increase in zero-shot accuracy and is more than twice as fast as ViT-L@336.

While ViT may be easier to use in a "tokenize everything for my transformer world" approach, it's worth considering that CNNs still deserve... attention ^^

Best,
Simon

Feature transformation before PCA

Thank you for sharing Figure 1 from the paper, which showcases the mapping of features to RGB channels using PCA. I found it to be really impressive! I was wondering if I could ask a question about the details of the PCA process. Specifically, I was curious to know if the features were normalized, scaled, or translated before applying PCA. If they were, could you kindly provide me with more information on how the normalization, scaling, or translation was carried out? For instance, I am curious about the axis along which normalization or scaling was performed, and whether the normalization or scaling factors were computed based on individual images or the entire training dataset. Thank you very much in advance for your help!

ResolvePackageNotFound: xformers::xformers=0.0.18

Can somebody please explain this? (Windows 10 + Anaconda)

Requirement already satisfied: xformers in c:\users\atc\appdata\roaming\python\python39\site-packages (0.0.18)

versus

ResolvePackageNotFound:
  - xformers::xformers=0.0.18

The full log:

(base) E:\dino-v2>conda env create -f conda.yaml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - xformers::xformers=0.0.18


(base) E:\dino-v2>pip install xformers
Requirement already satisfied: xformers in c:\users\atc\appdata\roaming\python\python39\site-packages (0.0.18)
Requirement already satisfied: pyre-extensions==0.0.23 in c:\users\atc\appdata\roaming\python\python39\site-packages (from xformers) (0.0.23)
Requirement already satisfied: torch==2.0.0 in c:\users\atc\appdata\roaming\python\python39\site-packages (from xformers) (2.0.0)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from xformers) (1.21.5)
Requirement already satisfied: typing-inspect in c:\users\atc\appdata\roaming\python\python39\site-packages (from pyre-extensions==0.0.23->xformers) (0.8.0)
Requirement already satisfied: typing-extensions in c:\programdata\anaconda3\lib\site-packages (from pyre-extensions==0.0.23->xformers) (4.3.0)
Requirement already satisfied: sympy in c:\programdata\anaconda3\lib\site-packages (from torch==2.0.0->xformers) (1.10.1)
Requirement already satisfied: networkx in c:\programdata\anaconda3\lib\site-packages (from torch==2.0.0->xformers) (2.8.4)
Requirement already satisfied: jinja2 in c:\programdata\anaconda3\lib\site-packages (from torch==2.0.0->xformers) (2.11.3)
Requirement already satisfied: filelock in c:\programdata\anaconda3\lib\site-packages (from torch==2.0.0->xformers) (3.6.0)
Requirement already satisfied: MarkupSafe>=0.23 in c:\programdata\anaconda3\lib\site-packages (from jinja2->torch==2.0.0->xformers) (2.0.1)
Requirement already satisfied: mpmath>=0.19 in c:\programdata\anaconda3\lib\site-packages (from sympy->torch==2.0.0->xformers) (1.2.1)
Requirement already satisfied: mypy-extensions>=0.3.0 in c:\programdata\anaconda3\lib\site-packages (from typing-inspect->pyre-extensions==0.0.23->xformers) (0.4.3)

(base) E:\dino-v2>conda env create -f conda.yaml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - xformers::xformers=0.0.18


(base) E:\dino-v2>pip install --pre -U xformers
Requirement already satisfied: xformers in c:\users\atc\appdata\roaming\python\python39\site-packages (0.0.18)
Collecting xformers
  Downloading xformers-0.0.19.dev516-cp39-cp39-win_amd64.whl (112.2 MB)
     ---------------------------------------- 112.2/112.2 MB 767.9 kB/s eta 0:00:00
Collecting pyre-extensions==0.0.29
  Downloading pyre_extensions-0.0.29-py3-none-any.whl (12 kB)
Requirement already satisfied: torch==2.0.0 in c:\users\atc\appdata\roaming\python\python39\site-packages (from xformers) (2.0.0)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from xformers) (1.21.5)
Requirement already satisfied: typing-extensions in c:\programdata\anaconda3\lib\site-packages (from pyre-extensions==0.0.29->xformers) (4.3.0)
Requirement already satisfied: typing-inspect in c:\users\atc\appdata\roaming\python\python39\site-packages (from pyre-extensions==0.0.29->xformers) (0.8.0)
Requirement already satisfied: filelock in c:\programdata\anaconda3\lib\site-packages (from torch==2.0.0->xformers) (3.6.0)
Requirement already satisfied: networkx in c:\programdata\anaconda3\lib\site-packages (from torch==2.0.0->xformers) (2.8.4)
Requirement already satisfied: jinja2 in c:\programdata\anaconda3\lib\site-packages (from torch==2.0.0->xformers) (2.11.3)
Requirement already satisfied: sympy in c:\programdata\anaconda3\lib\site-packages (from torch==2.0.0->xformers) (1.10.1)
Requirement already satisfied: MarkupSafe>=0.23 in c:\programdata\anaconda3\lib\site-packages (from jinja2->torch==2.0.0->xformers) (2.0.1)
Requirement already satisfied: mpmath>=0.19 in c:\programdata\anaconda3\lib\site-packages (from sympy->torch==2.0.0->xformers) (1.2.1)
Requirement already satisfied: mypy-extensions>=0.3.0 in c:\programdata\anaconda3\lib\site-packages (from typing-inspect->pyre-extensions==0.0.29->xformers) (0.4.3)
Installing collected packages: pyre-extensions, xformers
  Attempting uninstall: pyre-extensions
    Found existing installation: pyre-extensions 0.0.23
    Uninstalling pyre-extensions-0.0.23:
      Successfully uninstalled pyre-extensions-0.0.23
  Attempting uninstall: xformers
    Found existing installation: xformers 0.0.18
    Uninstalling xformers-0.0.18:
      Successfully uninstalled xformers-0.0.18
Successfully installed pyre-extensions-0.0.29 xformers-0.0.19.dev516

(base) E:\dino-v2>conda env create -f conda.yaml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - xformers::xformers=0.0.18


Extend to 3D images

I would like to ask you if there is possibility to modify the code to feed 3D images? if not, do you have a plan to extend the code to 3D images?
Thanks,
Nima

Having trouble reproducing exact numbers for evaluation.

Hi, I tried to reproduce the evaluation numbers in Table 4 and Table 6 of the paper.
I downloaded the backbones and linear classifiers from the readme and composed the classifier like this:

class Dino(nn.Module):
    def __init__(self, type="dinov2_vits14", pretrained=False):
        super().__init__()
        # get feature model
        model = torch.hub.load(
            "facebookresearch/dinov2", type, pretrained=pretrained
        ).cuda()
        autocast_ctx = partial(
            torch.cuda.amp.autocast, enabled=True, dtype=torch.float16
        )
        self.feature_model = ModelWithIntermediateLayers(
            model, n_last_blocks=1, autocast_ctx=autocast_ctx
        ).cuda()
        sample_input = torch.randn(1, 3, 224, 224).cuda()
        sample_output = self.feature_model(sample_input)

        # get linear readout
        out_dim = create_linear_input(
            sample_output, use_n_blocks=1, use_avgpool=True
        ).shape[1]
        self.classifier = LinearClassifier(
            out_dim, use_n_blocks=1, use_avgpool=True
        ).cuda()
        if pretrained:
            vits_linear = torch.load(f"/pretrained_checkpoints/{type}_linear_head.pth")
            self.classifier.linear.load_state_dict(vits_linear)

    def forward(self, x):
        x = self.feature_model(x)
        x = self.classifier(x)
        return x

Unfortunately, I did not get the same results that you report in the paper.
The results I get for ViT-g/14 (ViT-S/14) are:

Val: 85.79 (80.44)
Real: 89.23 (86.14)
v2: 77.33 (69.76)
Inet-C: 23.38 (55.38)
Inet-A: 76.81 (33.85)
Inet-R: 79.68 (53.04)
Inet-Sketch: 62.47 (39.62) 

Maybe you can give me a hint as to where I'm doing something wrong?

Star de la chance

@facbookresearch/dinvo2 les marocains c'est des boss et des star de la chance

cordialement Mme. Yacine zohair ```
#yacinelemarocain

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.