Code Monkey home page Code Monkey logo

pvic's Introduction

PViC: Predicate Visual Context

This repository contains the official PyTorch implementation for the paper

Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Gould; Exploring Predicate Visual Context for Detecting Human-Object Interactions; In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 10411-10421.

[preprint] [paper] [video]

Abstract

Recently, the DETR framework has emerged as the dominant approach for human–object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.

Prerequisites

  1. Use the package management tool of your choice and run the following commands after creating your environment.
    # Say you are using Conda
    conda create --name pvic python=3.8
    conda activate pvic
    # Required dependencies
    pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
    pip install matplotlib==3.6.3 scipy==1.10.0 tqdm==4.64.1
    pip install numpy==1.24.1 timm==0.6.12
    pip install wandb==0.13.9 seaborn==0.13.0
    # Clone the repo and submodules
    git clone https://github.com/fredzzhang/pvic.git
    cd pvic
    git submodule init
    git submodule update
    pip install -e pocket
    # Build CUDA operator for MultiScaleDeformableAttention
    cd h_detr/models/ops
    python setup.py build install
  2. Prepare the HICO-DET dataset.
    1. If you have not downloaded the dataset before, run the following script.
      cd /path/to/pvic/hicodet
      bash download.sh
    2. If you have previously downloaded the dataset, simply create a soft link.
      cd /path/to/pvic/hicodet
      ln -s /path/to/hicodet_20160224_det ./hico_20160224_det
  3. Prepare the V-COCO dataset (contained in MS COCO).
    1. If you have not downloaded the dataset before, run the following script
      cd /path/to/pvic/vcoco
      bash download.sh
    2. If you have previously downloaded the dataset, simply create a soft link
      cd /path/to/pvic/vcoco
      ln -s /path/to/coco ./mscoco2014

Inference

Visualisation utilities are implemented to run inference on a single image and visualise the cross-attention weights. A reference model is provided for demonstration purpose if you don't want to train a model yourself. Download the model and save it to ./checkpoints/. Use the argument --index to select images and --action to specify the action index. Refer to the lookup table for action indices.

DETR=base python inference.py --resume checkpoints/pvic-detr-r50-hicodet.pth --index 4050 --action 111

The detected human-object pairs with scores overlayed are saved to fig.png, while the attention weights are saved to pair_xx_attn_head_x.png. Below are some sample outputs.

    

In addition, the argument --image-path enables inference on custom images.

Training and Testing

Refer to the documentation for model checkpoints and training/testing commands.

License

PViC is released under the BSD-3-Clause License.

Citation

If you find our work useful for your research, please consider citing us:

@inproceedings{zhang2023pvic,
  author    = {Zhang, Frederic Z. and Yuan, Yuhui and Campbell, Dylan and Zhong, Zhuoyao and Gould, Stephen},
  title     = {Exploring Predicate Visual Context in Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2023},
  pages     = {10411-10421},
}

@inproceedings{zhang2022upt,
  author    = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
  title     = {Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2022},
  pages     = {20104-20112}
}

@inproceedings{zhang2021scg,
  author    = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
  title     = {Spatially Conditioned Graphs for Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2021},
  pages     = {13319-13327}
}

pvic's People

Contributors

fredzzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pvic's Issues

Training ERROR

Hello, I respectfully appreciate the work you have done. I encountered the following issue during training, and I would greatly appreciate your help in solving it.
WARNING: Collected results are empty. Return zero AP for class 597.
WARNING: Collected results are empty. Return zero AP for class 598.
WARNING: Collected results are empty. Return zero AP for class 599.
Epoch 0 => mAP: 0.0000, rare: 0.0000, none-rare: 0.0000.
Traceback (most recent call last):
File "main.py", line 192, in
mp.spawn(main, nprocs=args.world_size, args=(args,))
File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data1/hujiajun/workstation/pvic/main.py", line 127, in main
engine(args.epochs)
File "/data1/hujiajun/workstation/pvic/pocket/pocket/core/distributed.py", line 139, in call
self._on_each_iteration()
File "/data1/hujiajun/workstation/pvic/utils.py", line 195, in _on_each_iteration
raise ValueError(f"The HOI loss is NaN for rank {self._rank}")
ValueError: The HOI loss is NaN for rank 2

Zero-shot inference for unseen verbs

Hi,
Thank you for your excellent work. We would like to try to do ZERO-shot inference. For unseen actions (UV), all unseen action labels will be removed in training.

Suppose we use 117 verb texts from the CLIP text with dimensions [117,512]. In training, a valid object-action list excludes interactions with unseen actions. So, does training use the seen verb texts (97) or still use the 117 texts?

We assume that there are no unseen actions (UVs) in training, so they are the same as invalid object-action combinations, and backpropagation does not update them (gradient is 0). May I ask if our understanding is correct?

    def compute_classification_loss(self, logits, prior, labels):
        prior = torch.cat(prior, dim=0).prod(1)
        x, y = torch.nonzero(prior).unbind(1)
#In backpropagation, the object corresponds to invalid verbs with a gradient value of 0, 
#so the corresponding parameters are not updated.

        logits = logits[:, x, y]
        prior = prior[x, y]
        labels = labels[None, x, y].repeat(len(logits), 1)

The problem in your code???

I noticed in your paper published at ICCV2023, it is mentioned that Layer normalization is used before concatenating spatial and content features to avoid numerical overflow and ensure stable training. However, I couldn't find this part in the code. So, I always get the wrong with 'Hoi loss is nan'?
image

[Bug] Cannot load the parameters of the advanced model (SwinL-based) on HICO-DET

run main.py with args as

            "args": [
                "--backbone", "swin_large",
                "--drop-path-rate", "0.5", 
                "--num-queries-one2one", "900",
                "--num-queries-one2many", "1500",
                "--world-size", "1",
                "--batch-size", "1",
                "--eval", 
                "--resume", "./checkpoints/h-defm-detr-swinL-dp0-mqs-lft-iter-2stg-hicodet.pth"
            ],

when it goes to

model.load_state_dict(checkpoint['model_state_dict'])

error appears as follows [ in two parts: Missing keys and Unexpected keys]

It seems that the model has been updated since the parameters were released. Could you please provide the latest version that aligns with the current code? Thank you very much!

RuntimeError('Error(s) in loading state_dict for PViC:

Missing key(s) in state_dict:
"detector.transformer.level_embed", "detector.transformer.encoder.layers.0.self_attn.sampling_offsets.weight", "detector.transformer.encoder.layers.0.self_attn.sampling_offsets.bias", "detector.transformer.encoder.layers.0.self_attn.attention_weights.weight", "detector.transformer.encoder.layers.0.self_attn.attention_weights.bias", "detector.transformer.encoder.layers.0.self_attn.value_proj.weight", "detector.transformer.encoder.layers.0.self_attn.value_proj.bias", "detector.transformer.encoder.layers.0.self_attn.output_proj.weight", "detector.transformer.encoder.layers.0.self_attn.output_proj.bias", "detector.transformer.encoder.layers.0.norm1.weight", "detector.transformer.encoder.layers.0.norm1.bias", "detector.transformer.encoder.layers.0.linear1.weight", "detector.transformer.encoder.layers.0.linear1.bias", "detector.transformer.encoder.layers.0.linear2.weight", "detector.transformer.encoder.layers.0.linear2.bias", "detector.transformer.encoder.layers.0.norm2.weight", "detector.transformer.encoder.layers.0.norm2.bias", "detector.transformer.encoder.layers.1.self_attn.sampling_offsets.weight", "detector.transformer.encoder.layers.1.self_attn.sampling_offsets.bias", "detector.transformer.encoder.layers.1.self_attn.attention_weights.weight", "detector.transformer.encoder.layers.1.self_attn.attention_weights.bias", "detector.transformer.encoder.layers.1.self_attn.value_proj.weight", "detector.transformer.encoder.layers.1.self_attn.value_proj.bias", "detector.transformer.encoder.layers.1.self_attn.output_proj.weight", "detector.transformer.encoder.layers.1.self_attn.output_proj.bias", "detector.transformer.encoder.layers.1.norm1.weight", "detector.transformer.encoder.layers.1.norm1.bias", "detector.transformer.encoder.layers.1.linear1.weight", "detector.transformer.encoder.layers.1.linear1.bias", "detector.transformer.encoder.layers.1.linear2.weight", "detector.transformer.encoder.layers.1.linear2.bias", "detector.transformer.encoder.layers.1.norm2.weight", "detector.transformer.encoder.layers.1.norm2.bias", "detector.transformer.encoder.layers.2.self_attn.sampling_offsets.weight", "detector.transformer.encoder.layers.2.self_attn.sampling_offsets.bias", "detector.transformer.encoder.layers.2.self_attn.attention_weights.weight", "detector.transformer.encoder.layers.2.self_attn.attention_weights.bias", "detector.transformer.encoder.layers.2.self_attn.value_proj.weight", "detector.transformer.encoder.layers.2.self_attn.value_proj.bias", "detector.transformer.encoder.layers.2.self_attn.output_proj.weight", "detector.transformer.encoder.layers.2.self_attn.output_proj.bias", "detector.transformer.encoder.layers.2.norm1.weight", "detector.transformer.encoder.layers.2.norm1.bias", "detector.transformer.encoder.layers.2.linear1.weight", "detector.transformer.encoder.layers.2.linear1.bias", "detector.transformer.encoder.layers.2.linear2.weight", "detector.transformer.encoder.layers.2.linear2.bias", "detector.transformer.encoder.layers.2.norm2.weight", "detector.transformer.encoder.layers.2.norm2.bias", "detector.transformer.encoder.layers.3.self_attn.sampling_offsets.weight", "detector.transformer.encoder.layers.3.self_attn.sampling_offsets.bias", "detector.transformer.encoder.layers.3.self_attn.attention_weights.weight", "detector.transformer.encoder.layers.3.self_attn.attention_weights.bias", "detector.transformer.encoder.layers.3.self_attn.value_proj.weight", "detector.transformer.encoder.layers.3.self_attn.value_proj.bias", "detector.transformer.encoder.layers.3.self_attn.output_proj.weight", "detector.transformer.encoder.layers.3.self_attn.output_proj.bias", "detector.transformer.encoder.layers.3.norm1.weight", "detector.transformer.encoder.layers.3.norm1.bias", "detector.transformer.encoder.layers.3.linear1.weight", "detector.transformer.encoder.layers.3.linear1.bias", "detector.transformer.encoder.layers.3.linear2.weight", "detector.transformer.encoder.layers.3.linear2.bias", "detector.transformer.encoder.layers.3.norm2.weight", "detector.transformer.encoder.layers.3.norm2.bias", "detector.transformer.encoder.layers.4.self_attn.sampling_offsets.weight", "detector.transformer.encoder.layers.4.self_attn.sampling_offsets.bias", "detector.transformer.encoder.layers.4.self_attn.attention_weights.weight", "detector.transformer.encoder.layers.4.self_attn.attention_weights.bias", "detector.transformer.encoder.layers.4.self_attn.value_proj.weight", "detector.transformer.encoder.layers.4.self_attn.value_proj.bias", "detector.transformer.encoder.layers.4.self_attn.output_proj.weight", "detector.transformer.encoder.layers.4.self_attn.output_proj.bias", "detector.transformer.encoder.layers.4.norm1.weight", "detector.transformer.encoder.layers.4.norm1.bias", "detector.transformer.encoder.layers.4.linear1.weight", "detector.transformer.encoder.layers.4.linear1.bias", "detector.transformer.encoder.layers.4.linear2.weight", "detector.transformer.encoder.layers.4.linear2.bias", "detector.transformer.encoder.layers.4.norm2.weight", "detector.transformer.encoder.layers.4.norm2.bias", "detector.transformer.encoder.layers.5.self_attn.sampling_offsets.weight", "detector.transformer.encoder.layers.5.self_attn.sampling_offsets.bias", "detector.transformer.encoder.layers.5.self_attn.attention_weights.weight", "detector.transformer.encoder.layers.5.self_attn.attention_weights.bias", "detector.transformer.encoder.layers.5.self_attn.value_proj.weight", "detector.transformer.encoder.layers.5.self_attn.value_proj.bias", "detector.transformer.encoder.layers.5.self_attn.output_proj.weight", "detector.transformer.encoder.layers.5.self_attn.output_proj.bias", "detector.transformer.encoder.layers.5.norm1.weight", "detector.transformer.encoder.layers.5.norm1.bias", "detector.transformer.encoder.layers.5.linear1.weight", "detector.transformer.encoder.layers.5.linear1.bias", "detector.transformer.encoder.layers.5.linear2.weight", "detector.transformer.encoder.layers.5.linear2.bias", "detector.transformer.encoder.layers.5.norm2.weight", "detector.transformer.encoder.layers.5.norm2.bias", "detector.transformer.decoder.layers.0.cross_attn.sampling_offsets.weight", "detector.transformer.decoder.layers.0.cross_attn.sampling_offsets.bias", "detector.transformer.decoder.layers.0.cross_attn.attention_weights.weight", "detector.transformer.decoder.layers.0.cross_attn.attention_weights.bias", "detector.transformer.decoder.layers.0.cross_attn.value_proj.weight", "detector.transformer.decoder.layers.0.cross_attn.value_proj.bias", "detector.transformer.decoder.layers.0.cross_attn.output_proj.weight", "detector.transformer.decoder.layers.0.cross_attn.output_proj.bias", "detector.transformer.decoder.layers.0.norm1.weight", "detector.transformer.decoder.layers.0.norm1.bias", "detector.transformer.decoder.layers.0.self_attn.in_proj_weight", "detector.transformer.decoder.layers.0.self_attn.in_proj_bias", "detector.transformer.decoder.layers.0.self_attn.out_proj.weight", "detector.transformer.decoder.layers.0.self_attn.out_proj.bias", "detector.transformer.decoder.layers.0.norm2.weight", "detector.transformer.decoder.layers.0.norm2.bias", "detector.transformer.decoder.layers.0.linear1.weight", "detector.transformer.decoder.layers.0.linear1.bias", "detector.transformer.decoder.layers.0.linear2.weight", "detector.transformer.decoder.layers.0.linear2.bias", "detector.transformer.decoder.layers.0.norm3.weight", "detector.transformer.decoder.layers.0.norm3.bias", "detector.transformer.decoder.layers.1.cross_attn.sampling_offsets.weight", "detector.transformer.decoder.layers.1.cross_attn.sampling_offsets.bias", "detector.transformer.decoder.layers.1.cross_attn.attention_weights.weight", "detector.transformer.decoder.layers.1.cross_attn.attention_weights.bias", "detector.transformer.decoder.layers.1.cross_attn.value_proj.weight", "detector.transformer.decoder.layers.1.cross_attn.value_proj.bias", "detector.transformer.decoder.layers.1.cross_attn.output_proj.weight", "detector.transformer.decoder.layers.1.cross_attn.output_proj.bias", "detector.transformer.decoder.layers.1.norm1.weight", "detector.transformer.decoder.layers.1.norm1.bias", "detector.transformer.decoder.layers.1.self_attn.in_proj_weight", "detector.transformer.decoder.layers.1.self_attn.in_proj_bias", "detector.transformer.decoder.layers.1.self_attn.out_proj.weight", "detector.transformer.decoder.layers.1.self_attn.out_proj.bias", "detector.transformer.decoder.layers.1.norm2.weight", "detector.transformer.decoder.layers.1.norm2.bias", "detector.transformer.decoder.layers.1.linear1.weight", "detector.transformer.decoder.layers.1.linear1.bias", "detector.transformer.decoder.layers.1.linear2.weight", "detector.transformer.decoder.layers.1.linear2.bias", "detector.transformer.decoder.layers.1.norm3.weight", "detector.transformer.decoder.layers.1.norm3.bias", "detector.transformer.decoder.layers.2.cross_attn.sampling_offsets.weight", "detector.transformer.decoder.layers.2.cross_attn.sampling_offsets.bias", "detector.transformer.decoder.layers.2.cross_attn.attention_weights.weight", "detector.transformer.decoder.layers.2.cross_attn.attention_weights.bias", "detector.transformer.decoder.layers.2.cross_attn.value_proj.weight", "detector.transformer.decoder.layers.2.cross_attn.value_proj.bias", "detector.transformer.decoder.layers.2.cross_attn.output_proj.weight", "detector.transformer.decoder.layers.2.cross_attn.output_proj.bias", "detector.transformer.decoder.layers.2.norm1.weight", "detector.transformer.decoder.layers.2.norm1.bias", "detector.transformer.decoder.layers.2.self_attn.in_proj_weight", "detector.transformer.decoder.layers.2.self_attn.in_proj_bias", "detector.transformer.decoder.layers.2.self_attn.out_proj.weight", "detector.transformer.decoder.layers.2.self_attn.out_proj.bias", "detector.transformer.decoder.layers.2.norm2.weight", "detector.transformer.decoder.layers.2.norm2.bias", "detector.transformer.decoder.layers.2.linear1.weight", "detector.transformer.decoder.layers.2.linear1.bias", "detector.transformer.decoder.layers.2.linear2.weight", "detector.transformer.decoder.layers.2.linear2.bias", "detector.transformer.decoder.layers.2.norm3.weight", "detector.transformer.decoder.layers.2.norm3.bias", "detector.transformer.decoder.layers.3.cross_attn.sampling_offsets.weight", "detector.transformer.decoder.layers.3.cross_attn.sampling_offsets.bias", "detector.transformer.decoder.layers.3.cross_attn.attention_weights.weight", "detector.transformer.decoder.layers.3.cross_attn.attention_weights.bias", "detector.transformer.decoder.layers.3.cross_attn.value_proj.weight", "detector.transformer.decoder.layers.3.cross_attn.value_proj.bias", "detector.transformer.decoder.layers.3.cross_attn.output_proj.weight", "detector.transformer.decoder.layers.3.cross_attn.output_proj.bias", "detector.transformer.decoder.layers.3.norm1.weight", "detector.transformer.decoder.layers.3.norm1.bias", "detector.transformer.decoder.layers.3.self_attn.in_proj_weight", "detector.transformer.decoder.layers.3.self_attn.in_proj_bias", "detector.transformer.decoder.layers.3.self_attn.out_proj.weight", "detector.transformer.decoder.layers.3.self_attn.out_proj.bias", "detector.transformer.decoder.layers.3.norm2.weight", "detector.transformer.decoder.layers.3.norm2.bias", "detector.transformer.decoder.layers.3.linear1.weight", "detector.transformer.decoder.layers.3.linear1.bias", "detector.transformer.decoder.layers.3.linear2.weight", "detector.transformer.decoder.layers.3.linear2.bias", "detector.transformer.decoder.layers.3.norm3.weight", "detector.transformer.decoder.layers.3.norm3.bias", "detector.transformer.decoder.layers.4.cross_attn.sampling_offsets.weight", "detector.transformer.decoder.layers.4.cross_attn.sampling_offsets.bias", "detector.transformer.decoder.layers.4.cross_attn.attention_weights.weight", "detector.transformer.decoder.layers.4.cross_attn.attention_weights.bias", "detector.transformer.decoder.layers.4.cross_attn.value_proj.weight", "detector.transformer.decoder.layers.4.cross_attn.value_proj.bias", "detector.transformer.decoder.layers.4.cross_attn.output_proj.weight", "detector.transformer.decoder.layers.4.cross_attn.output_proj.bias", "detector.transformer.decoder.layers.4.norm1.weight", "detector.transformer.decoder.layers.4.norm1.bias", "detector.transformer.decoder.layers.4.self_attn.in_proj_weight", "detector.transformer.decoder.layers.4.self_attn.in_proj_bias", "detector.transformer.decoder.layers.4.self_attn.out_proj.weight", "detector.transformer.decoder.layers.4.self_attn.out_proj.bias", "detector.transformer.decoder.layers.4.norm2.weight", "detector.transformer.decoder.layers.4.norm2.bias", "detector.transformer.decoder.layers.4.linear1.weight", "detector.transformer.decoder.layers.4.linear1.bias", "detector.transformer.decoder.layers.4.linear2.weight", "detector.transformer.decoder.layers.4.linear2.bias", "detector.transformer.decoder.layers.4.norm3.weight", "detector.transformer.decoder.layers.4.norm3.bias", "detector.transformer.decoder.layers.5.cross_attn.sampling_offsets.weight", "detector.transformer.decoder.layers.5.cross_attn.sampling_offsets.bias", "detector.transformer.decoder.layers.5.cross_attn.attention_weights.weight", "detector.transformer.decoder.layers.5.cross_attn.attention_weights.bias", "detector.transformer.decoder.layers.5.cross_attn.value_proj.weight", "detector.transformer.decoder.layers.5.cross_attn.value_proj.bias", "detector.transformer.decoder.layers.5.cross_attn.output_proj.weight", "detector.transformer.decoder.layers.5.cross_attn.output_proj.bias", "detector.transformer.decoder.layers.5.norm1.weight", "detector.transformer.decoder.layers.5.norm1.bias", "detector.transformer.decoder.layers.5.self_attn.in_proj_weight", "detector.transformer.decoder.layers.5.self_attn.in_proj_bias", "detector.transformer.decoder.layers.5.self_attn.out_proj.weight", "detector.transformer.decoder.layers.5.self_attn.out_proj.bias", "detector.transformer.decoder.layers.5.norm2.weight", "detector.transformer.decoder.layers.5.norm2.bias", "detector.transformer.decoder.layers.5.linear1.weight", "detector.transformer.decoder.layers.5.linear1.bias", "detector.transformer.decoder.layers.5.linear2.weight", "detector.transformer.decoder.layers.5.linear2.bias", "detector.transformer.decoder.layers.5.norm3.weight", "detector.transformer.decoder.layers.5.norm3.bias", "detector.transformer.decoder.bbox_embed.0.layers.0.weight", "detector.transformer.decoder.bbox_embed.0.layers.0.bias", "detector.transformer.decoder.bbox_embed.0.layers.1.weight", "detector.transformer.decoder.bbox_embed.0.layers.1.bias", "detector.transformer.decoder.bbox_embed.0.layers.2.weight", "detector.transformer.decoder.bbox_embed.0.layers.2.bias", "detector.transformer.decoder.bbox_embed.1.layers.0.weight", "detector.transformer.decoder.bbox_embed.1.layers.0.bias", "detector.transformer.decoder.bbox_embed.1.layers.1.weight", "detector.transformer.decoder.bbox_embed.1.layers.1.bias", "detector.transformer.decoder.bbox_embed.1.layers.2.weight", "detector.transformer.decoder.bbox_embed.1.layers.2.bias", "detector.transformer.decoder.bbox_embed.2.layers.0.weight", "detector.transformer.decoder.bbox_embed.2.layers.0.bias", "detector.transformer.decoder.bbox_embed.2.layers.1.weight", "detector.transformer.decoder.bbox_embed.2.layers.1.bias", "detector.transformer.decoder.bbox_embed.2.layers.2.weight", "detector.transformer.decoder.bbox_embed.2.layers.2.bias", "detector.transformer.decoder.bbox_embed.3.layers.0.weight", "detector.transformer.decoder.bbox_embed.3.layers.0.bias", "detector.transformer.decoder.bbox_embed.3.layers.1.weight", "detector.transformer.decoder.bbox_embed.3.layers.1.bias", "detector.transformer.decoder.bbox_embed.3.layers.2.weight", "detector.transformer.decoder.bbox_embed.3.layers.2.bias", "detector.transformer.decoder.bbox_embed.4.layers.0.weight", "detector.transformer.decoder.bbox_embed.4.layers.0.bias", "detector.transformer.decoder.bbox_embed.4.layers.1.weight", "detector.transformer.decoder.bbox_embed.4.layers.1.bias", "detector.transformer.decoder.bbox_embed.4.layers.2.weight", "detector.transformer.decoder.bbox_embed.4.layers.2.bias", "detector.transformer.decoder.bbox_embed.5.layers.0.weight", "detector.transformer.decoder.bbox_embed.5.layers.0.bias", "detector.transformer.decoder.bbox_embed.5.layers.1.weight", "detector.transformer.decoder.bbox_embed.5.layers.1.bias", "detector.transformer.decoder.bbox_embed.5.layers.2.weight", "detector.transformer.decoder.bbox_embed.5.layers.2.bias", "detector.transformer.decoder.bbox_embed.6.layers.0.weight", "detector.transformer.decoder.bbox_embed.6.layers.0.bias", "detector.transformer.decoder.bbox_embed.6.layers.1.weight", "detector.transformer.decoder.bbox_embed.6.layers.1.bias", "detector.transformer.decoder.bbox_embed.6.layers.2.weight", "detector.transformer.decoder.bbox_embed.6.layers.2.bias", "detector.transformer.decoder.class_embed.0.weight", "detector.transformer.decoder.class_embed.0.bias", "detector.transformer.decoder.class_embed.1.weight", "detector.transformer.decoder.class_embed.1.bias", "detector.transformer.decoder.class_embed.2.weight", "detector.transformer.decoder.class_embed.2.bias", "detector.transformer.decoder.class_embed.3.weight", "detector.transformer.decoder.class_embed.3.bias", "detector.transformer.decoder.class_embed.4.weight", "detector.transformer.decoder.class_embed.4.bias", "detector.transformer.decoder.class_embed.5.weight", "detector.transformer.decoder.class_embed.5.bias", "detector.transformer.decoder.class_embed.6.weight", "detector.transformer.decoder.class_embed.6.bias", "detector.transformer.enc_output.weight", "detector.transformer.enc_output.bias", "detector.transformer.enc_output_norm.weight", "detector.transformer.enc_output_norm.bias", "detector.transformer.pos_trans.weight", "detector.transformer.pos_trans.bias", "detector.transformer.pos_trans_norm.weight", "detector.transformer.pos_trans_norm.bias", "detector.class_embed.0.weight", "detector.class_embed.0.bias", "detector.class_embed.1.weight", "detector.class_embed.1.bias", "detector.class_embed.2.weight", "detector.class_embed.2.bias", "detector.class_embed.3.weight", "detector.class_embed.3.bias", "detector.class_embed.4.weight", "detector.class_embed.4.bias", "detector.class_embed.5.weight", "detector.class_embed.5.bias", "detector.class_embed.6.weight", "detector.class_embed.6.bias", "detector.bbox_embed.0.layers.0.weight", "detector.bbox_embed.0.layers.0.bias", "detector.bbox_embed.0.layers.1.weight", "detector.bbox_embed.0.layers.1.bias", "detector.bbox_embed.0.layers.2.weight", "detector.bbox_embed.0.layers.2.bias", "detector.bbox_embed.1.layers.0.weight", "detector.bbox_embed.1.layers.0.bias", "detector.bbox_embed.1.layers.1.weight", "detector.bbox_embed.1.layers.1.bias", "detector.bbox_embed.1.layers.2.weight", "detector.bbox_embed.1.layers.2.bias", "detector.bbox_embed.2.layers.0.weight", "detector.bbox_embed.2.layers.0.bias", "detector.bbox_embed.2.layers.1.weight", "detector.bbox_embed.2.layers.1.bias", "detector.bbox_embed.2.layers.2.weight", "detector.bbox_embed.2.layers.2.bias", "detector.bbox_embed.3.layers.0.weight", "detector.bbox_embed.3.layers.0.bias", "detector.bbox_embed.3.layers.1.weight", "detector.bbox_embed.3.layers.1.bias", "detector.bbox_embed.3.layers.2.weight", "detector.bbox_embed.3.layers.2.bias", "detector.bbox_embed.4.layers.0.weight", "detector.bbox_embed.4.layers.0.bias", "detector.bbox_embed.4.layers.1.weight", "detector.bbox_embed.4.layers.1.bias", "detector.bbox_embed.4.layers.2.weight", "detector.bbox_embed.4.layers.2.bias", "detector.bbox_embed.5.layers.0.weight", "detector.bbox_embed.5.layers.0.bias", "detector.bbox_embed.5.layers.1.weight", "detector.bbox_embed.5.layers.1.bias", "detector.bbox_embed.5.layers.2.weight", "detector.bbox_embed.5.layers.2.bias", "detector.bbox_embed.6.layers.0.weight", "detector.bbox_embed.6.layers.0.bias", "detector.bbox_embed.6.layers.1.weight", "detector.bbox_embed.6.layers.1.bias", "detector.bbox_embed.6.layers.2.weight", "detector.bbox_embed.6.layers.2.bias", "detector.query_embed.weight", "detector.input_proj.0.0.weight", "detector.input_proj.0.0.bias", "detector.input_proj.0.1.weight", "detector.input_proj.0.1.bias", "detector.input_proj.1.0.weight", "detector.input_proj.1.0.bias", "detector.input_proj.1.1.weight", "detector.input_proj.1.1.bias", "detector.input_proj.2.0.weight", "detector.input_proj.2.0.bias", "detector.input_proj.2.1.weight", "detector.input_proj.2.1.bias", "detector.input_proj.3.0.weight", "detector.input_proj.3.0.bias", "detector.input_proj.3.1.weight", "detector.input_proj.3.1.bias", "detector.backbone.0.body.patch_embed.proj.weight", "detector.backbone.0.body.patch_embed.proj.bias", "detector.backbone.0.body.patch_embed.norm.weight", "detector.backbone.0.body.patch_embed.norm.bias", "detector.backbone.0.body.layers.0.blocks.0.norm1.weight", "detector.backbone.0.body.layers.0.blocks.0.norm1.bias", "detector.backbone.0.body.layers.0.blocks.0.attn.relative_position_bias_table", "detector.backbone.0.body.layers.0.blocks.0.attn.relative_position_index", "detector.backbone.0.body.layers.0.blocks.0.attn.qkv.weight", "detector.backbone.0.body.layers.0.blocks.0.attn.qkv.bias", "detector.backbone.0.body.layers.0.blocks.0.attn.proj.weight", "detector.backbone.0.body.layers.0.blocks.0.attn.proj.bias", "detector.backbone.0.body.layers.0.blocks.0.norm2.weight", "detector.backbone.0.body.layers.0.blocks.0.norm2.bias", "detector.backbone.0.body.layers.0.blocks.0.mlp.fc1.weight", "detector.backbone.0.body.layers.0.blocks.0.mlp.fc1.bias", "detector.backbone.0.body.layers.0.blocks.0.mlp.fc2.weight", "detector.backbone.0.body.layers.0.blocks.0.mlp.fc2.bias", "detector.backbone.0.body.layers.0.blocks.1.norm1.weight", "detector.backbone.0.body.layers.0.blocks.1.norm1.bias", "detector.backbone.0.body.layers.0.blocks.1.attn.relative_position_bias_table", "detector.backbone.0.body.layers.0.blocks.1.attn.relative_position_index", "detector.backbone.0.body.layers.0.blocks.1.attn.qkv.weight", "detector.backbone.0.body.layers.0.blocks.1.attn.qkv.bias", "detector.backbone.0.body.layers.0.blocks.1.attn.proj.weight", "detector.backbone.0.body.layers.0.blocks.1.attn.proj.bias", "detector.backbone.0.body.layers.0.blocks.1.norm2.weight", "detector.backbone.0.body.layers.0.blocks.1.norm2.bias", "detector.backbone.0.body.layers.0.blocks.1.mlp.fc1.weight", "detector.backbone.0.body.layers.0.blocks.1.mlp.fc1.bias", "detector.backbone.0.body.layers.0.blocks.1.mlp.fc2.weight", "detector.backbone.0.body.layers.0.blocks.1.mlp.fc2.bias", "detector.backbone.0.body.layers.0.downsample.reduction.weight", "detector.backbone.0.body.layers.0.downsample.norm.weight", "detector.backbone.0.body.layers.0.downsample.norm.bias", "detector.backbone.0.body.layers.1.blocks.0.norm1.weight", "detector.backbone.0.body.layers.1.blocks.0.norm1.bias", "detector.backbone.0.body.layers.1.blocks.0.attn.relative_position_bias_table", "detector.backbone.0.body.layers.1.blocks.0.attn.relative_position_index", "detector.backbone.0.body.layers.1.blocks.0.attn.qkv.weight", "detector.backbone.0.body.layers.1.blocks.0.attn.qkv.bias", "detector.backbone.0.body.layers.1.blocks.0.attn.proj.weight", "detector.backbone.0.body.layers.1.blocks.0.attn.proj.bias", "detector.backbone.0.body.layers.1.blocks.0.norm2.weight", "detector.backbone.0.body.layers.1.blocks.0.norm2.bias", "detector.backbone.0.body.layers.1.blocks.0.mlp.fc1.weight", "detector.backbone.0.body.layers.1.blocks.0.mlp.fc1.bias", "detector.backbone.0.body.layers.1.blocks.0.mlp.fc2.weight", "detector.backbone.0.body.layers.1.blocks.0.mlp.fc2.bias", "detector.backbone.0.body.layers.1.blocks.1.norm1.weight", "detector.backbone.0.body.layers.1.blocks.1.norm1.bias", "detector.backbone.0.body.layers.1.blocks.1.attn.relative_position_bias_table", "detector.backbone.0.body.layers.1.blocks.1.attn.relative_position_index", "detector.backbone.0.body.layers.1.blocks.1.attn.qkv.weight", "detector.backbone.0.body.layers.1.blocks.1.attn.qkv.bias", "detector.backbone.0.body.layers.1.blocks.1.attn.proj.weight", "detector.backbone.0.body.layers.1.blocks.1.attn.proj.bias", "detector.backbone.0.body.layers.1.blocks.1.norm2.weight", "detector.backbone.0.body.layers.1.blocks.1.norm2.bias", "detector.backbone.0.body.layers.1.blocks.1.mlp.fc1.weight", "detector.backbone.0.body.layers.1.blocks.1.mlp.fc1.bias", "detector.backbone.0.body.layers.1.blocks.1.mlp.fc2.weight", "detector.backbone.0.body.layers.1.blocks.1.mlp.fc2.bias", "detector.backbone.0.body.layers.1.downsample.reduction.weight", "detector.backbone.0.body.layers.1.downsample.norm.weight", "detector.backbone.0.body.layers.1.downsample.norm.bias", "detector.backbone.0.body.layers.2.blocks.0.norm1.weight", "detector.backbone.0.body.layers.2.blocks.0.norm1.bias", "detector.backbone.0.body.layers.2.blocks.0.attn.relative_position_bias_table", "detector.backbone.0.body.layers.2.blocks.0.attn.relative_position_index", "detector.backbone.0.body.layers.2.blocks.0.attn.qkv.weight", "detector.backbone.0.body.layers.2.blocks.0.attn.qkv.bias", "detector.backbone.0.body.layers.2.blocks.0.attn.proj.weight", "detector.backbone.0.body.layers.2.blocks.0.attn.proj.bias", "detector.backbone.0.body.layers.2.blocks.0.norm2.weight", "detector.backbone.0.body.layers.2.blocks.0.norm2.bias", "detector.backbone.0.body.layers.2.blocks.0.mlp.fc1.weight", "detector.backbone.0.body.layers.2.blocks.0.mlp.fc1.bias", "detector.backbone.0.body.layers.2.blocks.0.mlp.fc2.weight", "detector.backbone.0.body.layers.2.blocks.0.mlp.fc2.bias", "detector.backbone.0.body.layers.2.blocks.1.norm1.weight", "detector.backbone.0.body.layers.2.blocks.1.norm1.bias", "detector.backbone.0.body.layers.2.blocks.1.attn.relative_position_bias_table", "detector.backbone.0.body.layers.2.blocks.1.attn.relative_position_index", "detector.backbone.0.body.layers.2.blocks.1.attn.qkv.weight", "detector.backbone.0.body.layers.2.blocks.1.attn.qkv.bias", "detector.backbone.0.body.layers.2.blocks.1.attn.proj.weight", "detector.backbone.0.body.layers.2.blocks.1.attn.proj.bias", "detector.backbone.0.body.layers.2.blocks.1.norm2.weight", "detector.backbone.0.body.layers.2.blocks.1.norm2.bias", "detector.backbone.0.body.layers.2.blocks.1.mlp.fc1.weight", "detector.backbone.0.body.layers.2.blocks.1.mlp.fc1.bias", "detector.backbone.0.body.layers.2.blocks.1.mlp.fc2.weight", "detector.backbone.0.body.layers.2.blocks.1.mlp.fc2.bias", "detector.backbone.0.body.layers.2.blocks.2.norm1.weight", "detector.backbone.0.body.layers.2.blocks.2.norm1.bias", "detector.backbone.0.body.layers.2.blocks.2.attn.relative_position_bias_table", "detector.backbone.0.body.layers.2.blocks.2.attn.relative_position_index", "detector.backbone.0.body.layers.2.blocks.2.attn.qkv.weight", "detector.backbone.0.body.layers.2.blocks.2.attn.qkv.bias", "detector.backbone.0.body.layers.2.blocks.2.attn.proj.weight", "detector.backbone.0.body.layers.2.blocks.2.attn.proj.bias", "detector.backbone.0.body.layers.2.blocks.2.norm2.weight", "detector.backbone.0.body.layers.2.blocks.2.norm2.bias", "detector.backbone.0.body.layers.2.blocks.2.mlp.fc1.weight", "detector.backbone.0.body.layers.2.blocks.2.mlp.fc1.bias", "detector.backbone.0.body.layers.2.blocks.2.mlp.fc2.weight", "detector.backbone.0.body.layers.2.blocks.2.mlp.fc2.bias", "detector.backbone.0.body.layers.2.blocks.3.norm1.weight", "detector.backbone.0.body.layers.2.blocks.3.norm1.bias", "detector.backbone.0.body.layers.2.blocks.3.attn.relative_position_bias_table", "detector.backbone.0.body.layers.2.blocks.3.attn.relative_position_index", "detector.backbone.0.body.layers.2.blocks.3.attn.qkv.weight", "detector.backbone.0.body.layers.2.blocks.3.attn.qkv.bias", "detector.backbone.0.body.layers.2.blocks.3.attn.proj.weight", "detector.backbone.0.body.layers.2.blocks.3.attn.proj.bias", "detector.backbone.0.body.layers.2.blocks.3.norm2.weight", "detector.backbone.0.body.layers.2.blocks.3.norm2.bias", "detector.backbone.0.body.layers.2.blocks.3.mlp.fc1.weight", "detector.backbone.0.body.layers.2.blocks.3.mlp.fc1.bias", "detector.backbone.0.body.layers.2.blocks.3.mlp.fc2.weight", "detector.backbone.0.body.layers.2.blocks.3.mlp.fc2.bias", "detector.backbone.0.body.layers.2.blocks.4.norm1.weight", "detector.backbone.0.body.layers.2.blocks.4.norm1.bias", "detector.backbone.0.body.layers.2.blocks.4.attn.relative_position_bias_table", "detector.backbone.0.body.layers.2.blocks.4.attn.relative_position_index", "detector.backbone.0.body.layers.2.blocks.4.attn.qkv.weight",

(...)

Unexpected key(s) in state_dict:
"transformer.level_embed", "transformer.encoder.layers.0.self_attn.sampling_offsets.weight", "transformer.encoder.layers.0.self_attn.sampling_offsets.bias", "transformer.encoder.layers.0.self_attn.attention_weights.weight", "transformer.encoder.layers.0.self_attn.attention_weights.bias", "transformer.encoder.layers.0.self_attn.value_proj.weight", "transformer.encoder.layers.0.self_attn.value_proj.bias", "transformer.encoder.layers.0.self_attn.output_proj.weight", "transformer.encoder.layers.0.self_attn.output_proj.bias", "transformer.encoder.layers.0.norm1.weight", "transformer.encoder.layers.0.norm1.bias", "transformer.encoder.layers.0.linear1.weight", "transformer.encoder.layers.0.linear1.bias", "transformer.encoder.layers.0.linear2.weight", "transformer.encoder.layers.0.linear2.bias", "transformer.encoder.layers.0.norm2.weight", "transformer.encoder.layers.0.norm2.bias", "transformer.encoder.layers.1.self_attn.sampling_offsets.weight", "transformer.encoder.layers.1.self_attn.sampling_offsets.bias", "transformer.encoder.layers.1.self_attn.attention_weights.weight", "transformer.encoder.layers.1.self_attn.attention_weights.bias", "transformer.encoder.layers.1.self_attn.value_proj.weight", "transformer.encoder.layers.1.self_attn.value_proj.bias", "transformer.encoder.layers.1.self_attn.output_proj.weight", "transformer.encoder.layers.1.self_attn.output_proj.bias", "transformer.encoder.layers.1.norm1.weight", "transformer.encoder.layers.1.norm1.bias", "transformer.encoder.layers.1.linear1.weight", "transformer.encoder.layers.1.linear1.bias", "transformer.encoder.layers.1.linear2.weight", "transformer.encoder.layers.1.linear2.bias", "transformer.encoder.layers.1.norm2.weight", "transformer.encoder.layers.1.norm2.bias", "transformer.encoder.layers.2.self_attn.sampling_offsets.weight", "transformer.encoder.layers.2.self_attn.sampling_offsets.bias", "transformer.encoder.layers.2.self_attn.attention_weights.weight", "transformer.encoder.layers.2.self_attn.attention_weights.bias", "transformer.encoder.layers.2.self_attn.value_proj.weight", "transformer.encoder.layers.2.self_attn.value_proj.bias", "transformer.encoder.layers.2.self_attn.output_proj.weight", "transformer.encoder.layers.2.self_attn.output_proj.bias", "transformer.encoder.layers.2.norm1.weight", "transformer.encoder.layers.2.norm1.bias", "transformer.encoder.layers.2.linear1.weight", "transformer.encoder.layers.2.linear1.bias", "transformer.encoder.layers.2.linear2.weight", "transformer.encoder.layers.2.linear2.bias", "transformer.encoder.layers.2.norm2.weight", "transformer.encoder.layers.2.norm2.bias", "transformer.encoder.layers.3.self_attn.sampling_offsets.weight", "transformer.encoder.layers.3.self_attn.sampling_offsets.bias", "transformer.encoder.layers.3.self_attn.attention_weights.weight", "transformer.encoder.layers.3.self_attn.attention_weights.bias", "transformer.encoder.layers.3.self_attn.value_proj.weight", "transformer.encoder.layers.3.self_attn.value_proj.bias", "transformer.encoder.layers.3.self_attn.output_proj.weight", "transformer.encoder.layers.3.self_attn.output_proj.bias", "transformer.encoder.layers.3.norm1.weight", "transformer.encoder.layers.3.norm1.bias", "transformer.encoder.layers.3.linear1.weight", "transformer.encoder.layers.3.linear1.bias", "transformer.encoder.layers.3.linear2.weight", "transformer.encoder.layers.3.linear2.bias", "transformer.encoder.layers.3.norm2.weight", "transformer.encoder.layers.3.norm2.bias", "transformer.encoder.layers.4.self_attn.sampling_offsets.weight", "transformer.encoder.layers.4.self_attn.sampling_offsets.bias", "transformer.encoder.layers.4.self_attn.attention_weights.weight", "transformer.encoder.layers.4.self_attn.attention_weights.bias", "transformer.encoder.layers.4.self_attn.value_proj.weight", "transformer.encoder.layers.4.self_attn.value_proj.bias", "transformer.encoder.layers.4.self_attn.output_proj.weight", "transformer.encoder.layers.4.self_attn.output_proj.bias", "transformer.encoder.layers.4.norm1.weight", "transformer.encoder.layers.4.norm1.bias", "transformer.encoder.layers.4.linear1.weight", "transformer.encoder.layers.4.linear1.bias", "transformer.encoder.layers.4.linear2.weight", "transformer.encoder.layers.4.linear2.bias", "transformer.encoder.layers.4.norm2.weight", "transformer.encoder.layers.4.norm2.bias", "transformer.encoder.layers.5.self_attn.sampling_offsets.weight", "transformer.encoder.layers.5.self_attn.sampling_offsets.bias", "transformer.encoder.layers.5.self_attn.attention_weights.weight", "transformer.encoder.layers.5.self_attn.attention_weights.bias", "transformer.encoder.layers.5.self_attn.value_proj.weight", "transformer.encoder.layers.5.self_attn.value_proj.bias", "transformer.encoder.layers.5.self_attn.output_proj.weight", "transformer.encoder.layers.5.self_attn.output_proj.bias", "transformer.encoder.layers.5.norm1.weight", "transformer.encoder.layers.5.norm1.bias", "transformer.encoder.layers.5.linear1.weight", "transformer.encoder.layers.5.linear1.bias", "transformer.encoder.layers.5.linear2.weight", "transformer.encoder.layers.5.linear2.bias", "transformer.encoder.layers.5.norm2.weight", "transformer.encoder.layers.5.norm2.bias", "transformer.decoder.layers.0.cross_attn.sampling_offsets.weight", "transformer.decoder.layers.0.cross_attn.sampling_offsets.bias", "transformer.decoder.layers.0.cross_attn.attention_weights.weight", "transformer.decoder.layers.0.cross_attn.attention_weights.bias", "transformer.decoder.layers.0.cross_attn.value_proj.weight", "transformer.decoder.layers.0.cross_attn.value_proj.bias", "transformer.decoder.layers.0.cross_attn.output_proj.weight", "transformer.decoder.layers.0.cross_attn.output_proj.bias", "transformer.decoder.layers.0.norm1.weight", "transformer.decoder.layers.0.norm1.bias", "transformer.decoder.layers.0.self_attn.in_proj_weight", "transformer.decoder.layers.0.self_attn.in_proj_bias", "transformer.decoder.layers.0.self_attn.out_proj.weight", "transformer.decoder.layers.0.self_attn.out_proj.bias", "transformer.decoder.layers.0.norm2.weight", "transformer.decoder.layers.0.norm2.bias", "transformer.decoder.layers.0.linear1.weight", "transformer.decoder.layers.0.linear1.bias", "transformer.decoder.layers.0.linear2.weight", "transformer.decoder.layers.0.linear2.bias", "transformer.decoder.layers.0.norm3.weight", "transformer.decoder.layers.0.norm3.bias", "transformer.decoder.layers.1.cross_attn.sampling_offsets.weight", "transformer.decoder.layers.1.cross_attn.sampling_offsets.bias", "transformer.decoder.layers.1.cross_attn.attention_weights.weight", "transformer.decoder.layers.1.cross_attn.attention_weights.bias", "transformer.decoder.layers.1.cross_attn.value_proj.weight", "transformer.decoder.layers.1.cross_attn.value_proj.bias", "transformer.decoder.layers.1.cross_attn.output_proj.weight", "transformer.decoder.layers.1.cross_attn.output_proj.bias", "transformer.decoder.layers.1.norm1.weight", "transformer.decoder.layers.1.norm1.bias", "transformer.decoder.layers.1.self_attn.in_proj_weight", "transformer.decoder.layers.1.self_attn.in_proj_bias", "transformer.decoder.layers.1.self_attn.out_proj.weight", "transformer.decoder.layers.1.self_attn.out_proj.bias", "transformer.decoder.layers.1.norm2.weight", "transformer.decoder.layers.1.norm2.bias", "transformer.decoder.layers.1.linear1.weight", "transformer.decoder.layers.1.linear1.bias", "transformer.decoder.layers.1.linear2.weight", "transformer.decoder.layers.1.linear2.bias", "transformer.decoder.layers.1.norm3.weight", "transformer.decoder.layers.1.norm3.bias", "transformer.decoder.layers.2.cross_attn.sampling_offsets.weight", "transformer.decoder.layers.2.cross_attn.sampling_offsets.bias", "transformer.decoder.layers.2.cross_attn.attention_weights.weight", "transformer.decoder.layers.2.cross_attn.attention_weights.bias", "transformer.decoder.layers.2.cross_attn.value_proj.weight", "transformer.decoder.layers.2.cross_attn.value_proj.bias", "transformer.decoder.layers.2.cross_attn.output_proj.weight", "transformer.decoder.layers.2.cross_attn.output_proj.bias", "transformer.decoder.layers.2.norm1.weight", "transformer.decoder.layers.2.norm1.bias", "transformer.decoder.layers.2.self_attn.in_proj_weight", "transformer.decoder.layers.2.self_attn.in_proj_bias", "transformer.decoder.layers.2.self_attn.out_proj.weight", "transformer.decoder.layers.2.self_attn.out_proj.bias", "transformer.decoder.layers.2.norm2.weight", "transformer.decoder.layers.2.norm2.bias", "transformer.decoder.layers.2.linear1.weight", "transformer.decoder.layers.2.linear1.bias", "transformer.decoder.layers.2.linear2.weight", "transformer.decoder.layers.2.linear2.bias", "transformer.decoder.layers.2.norm3.weight", "transformer.decoder.layers.2.norm3.bias", "transformer.decoder.layers.3.cross_attn.sampling_offsets.weight", "transformer.decoder.layers.3.cross_attn.sampling_offsets.bias", "transformer.decoder.layers.3.cross_attn.attention_weights.weight", "transformer.decoder.layers.3.cross_attn.attention_weights.bias", "transformer.decoder.layers.3.cross_attn.value_proj.weight", "transformer.decoder.layers.3.cross_attn.value_proj.bias", "transformer.decoder.layers.3.cross_attn.output_proj.weight", "transformer.decoder.layers.3.cross_attn.output_proj.bias", "transformer.decoder.layers.3.norm1.weight", "transformer.decoder.layers.3.norm1.bias", "transformer.decoder.layers.3.self_attn.in_proj_weight", "transformer.decoder.layers.3.self_attn.in_proj_bias", "transformer.decoder.layers.3.self_attn.out_proj.weight", "transformer.decoder.layers.3.self_attn.out_proj.bias", "transformer.decoder.layers.3.norm2.weight", "transformer.decoder.layers.3.norm2.bias", "transformer.decoder.layers.3.linear1.weight", "transformer.decoder.layers.3.linear1.bias", "transformer.decoder.layers.3.linear2.weight", "transformer.decoder.layers.3.linear2.bias", "transformer.decoder.layers.3.norm3.weight", "transformer.decoder.layers.3.norm3.bias", "transformer.decoder.layers.4.cross_attn.sampling_offsets.weight", "transformer.decoder.layers.4.cross_attn.sampling_offsets.bias", "transformer.decoder.layers.4.cross_attn.attention_weights.weight", "transformer.decoder.layers.4.cross_attn.attention_weights.bias", "transformer.decoder.layers.4.cross_attn.value_proj.weight", "transformer.decoder.layers.4.cross_attn.value_proj.bias", "transformer.decoder.layers.4.cross_attn.output_proj.weight", "transformer.decoder.layers.4.cross_attn.output_proj.bias", "transformer.decoder.layers.4.norm1.weight", "transformer.decoder.layers.4.norm1.bias", "transformer.decoder.layers.4.self_attn.in_proj_weight", "transformer.decoder.layers.4.self_attn.in_proj_bias", "transformer.decoder.layers.4.self_attn.out_proj.weight", "transformer.decoder.layers.4.self_attn.out_proj.bias", "transformer.decoder.layers.4.norm2.weight", "transformer.decoder.layers.4.norm2.bias", "transformer.decoder.layers.4.linear1.weight", "transformer.decoder.layers.4.linear1.bias", "transformer.decoder.layers.4.linear2.weight", "transformer.decoder.layers.4.linear2.bias", "transformer.decoder.layers.4.norm3.weight", "transformer.decoder.layers.4.norm3.bias", "transformer.decoder.layers.5.cross_attn.sampling_offsets.weight", "transformer.decoder.layers.5.cross_attn.sampling_offsets.bias", "transformer.decoder.layers.5.cross_attn.attention_weights.weight", "transformer.decoder.layers.5.cross_attn.attention_weights.bias", "transformer.decoder.layers.5.cross_attn.value_proj.weight", "transformer.decoder.layers.5.cross_attn.value_proj.bias", "transformer.decoder.layers.5.cross_attn.output_proj.weight", "transformer.decoder.layers.5.cross_attn.output_proj.bias", "transformer.decoder.layers.5.norm1.weight", "transformer.decoder.layers.5.norm1.bias", "transformer.decoder.layers.5.self_attn.in_proj_weight", "transformer.decoder.layers.5.self_attn.in_proj_bias", "transformer.decoder.layers.5.self_attn.out_proj.weight", "transformer.decoder.layers.5.self_attn.out_proj.bias", "transformer.decoder.layers.5.norm2.weight", "transformer.decoder.layers.5.norm2.bias", "transformer.decoder.layers.5.linear1.weight", "transformer.decoder.layers.5.linear1.bias", "transformer.decoder.layers.5.linear2.weight", "transformer.decoder.layers.5.linear2.bias", "transformer.decoder.layers.5.norm3.weight", "transformer.decoder.layers.5.norm3.bias", "transformer.decoder.bbox_embed.0.layers.0.weight", "transformer.decoder.bbox_embed.0.layers.0.bias", "transformer.decoder.bbox_embed.0.layers.1.weight", "transformer.decoder.bbox_embed.0.layers.1.bias", "transformer.decoder.bbox_embed.0.layers.2.weight", "transformer.decoder.bbox_embed.0.layers.2.bias", "transformer.decoder.bbox_embed.1.layers.0.weight", "transformer.decoder.bbox_embed.1.layers.0.bias", "transformer.decoder.bbox_embed.1.layers.1.weight", "transformer.decoder.bbox_embed.1.layers.1.bias", "transformer.decoder.bbox_embed.1.layers.2.weight", "transformer.decoder.bbox_embed.1.layers.2.bias", "transformer.decoder.bbox_embed.2.layers.0.weight", "transformer.decoder.bbox_embed.2.layers.0.bias", "transformer.decoder.bbox_embed.2.layers.1.weight", "transformer.decoder.bbox_embed.2.layers.1.bias", "transformer.decoder.bbox_embed.2.layers.2.weight", "transformer.decoder.bbox_embed.2.layers.2.bias", "transformer.decoder.bbox_embed.3.layers.0.weight", "transformer.decoder.bbox_embed.3.layers.0.bias", "transformer.decoder.bbox_embed.3.layers.1.weight", "transformer.decoder.bbox_embed.3.layers.1.bias", "transformer.decoder.bbox_embed.3.layers.2.weight", "transformer.decoder.bbox_embed.3.layers.2.bias", "transformer.decoder.bbox_embed.4.layers.0.weight", "transformer.decoder.bbox_embed.4.layers.0.bias", "transformer.decoder.bbox_embed.4.layers.1.weight", "transformer.decoder.bbox_embed.4.layers.1.bias", "transformer.decoder.bbox_embed.4.layers.2.weight", "transformer.decoder.bbox_embed.4.layers.2.bias", "transformer.decoder.bbox_embed.5.layers.0.weight", "transformer.decoder.bbox_embed.5.layers.0.bias", "transformer.decoder.bbox_embed.5.layers.1.weight", "transformer.decoder.bbox_embed.5.layers.1.bias", "transformer.decoder.bbox_embed.5.layers.2.weight", "transformer.decoder.bbox_embed.5.layers.2.bias", "transformer.decoder.bbox_embed.6.layers.0.weight", "transformer.decoder.bbox_embed.6.layers.0.bias", "transformer.decoder.bbox_embed.6.layers.1.weight", "transformer.decoder.bbox_embed.6.layers.1.bias", "transformer.decoder.bbox_embed.6.layers.2.weight", "transformer.decoder.bbox_embed.6.layers.2.bias", "transformer.decoder.class_embed.0.weight", "transformer.decoder.class_embed.0.bias", "transformer.decoder.class_embed.1.weight", "transformer.decoder.class_embed.1.bias", "transformer.decoder.class_embed.2.weight", "transformer.decoder.class_embed.2.bias", "transformer.decoder.class_embed.3.weight", "transformer.decoder.class_embed.3.bias", "transformer.decoder.class_embed.4.weight", "transformer.decoder.class_embed.4.bias", "transformer.decoder.class_embed.5.weight", "transformer.decoder.class_embed.5.bias", "transformer.decoder.class_embed.6.weight", "transformer.decoder.class_embed.6.bias", "transformer.enc_output.weight", "transformer.enc_output.bias", "transformer.enc_output_norm.weight", "transformer.enc_output_norm.bias", "transformer.pos_trans.weight", "transformer.pos_trans.bias", "transformer.pos_trans_norm.weight", "transformer.pos_trans_norm.bias", "class_embed.0.weight", "class_embed.0.bias", "class_embed.1.weight", "class_embed.1.bias", "class_embed.2.weight", "class_embed.2.bias", "class_embed.3.weight", "class_embed.3.bias", "class_embed.4.weight", "class_embed.4.bias", "class_embed.5.weight", "class_embed.5.bias", "class_embed.6.weight", "class_embed.6.bias", "bbox_embed.0.layers.0.weight", "bbox_embed.0.layers.0.bias", "bbox_embed.0.layers.1.weight", "bbox_embed.0.layers.1.bias", "bbox_embed.0.layers.2.weight", "bbox_embed.0.layers.2.bias", "bbox_embed.1.layers.0.weight", "bbox_embed.1.layers.0.bias", "bbox_embed.1.layers.1.weight", "bbox_embed.1.layers.1.bias", "bbox_embed.1.layers.2.weight", "bbox_embed.1.layers.2.bias", "bbox_embed.2.layers.0.weight", "bbox_embed.2.layers.0.bias", "bbox_embed.2.layers.1.weight", "bbox_embed.2.layers.1.bias", "bbox_embed.2.layers.2.weight", "bbox_embed.2.layers.2.bias", "bbox_embed.3.layers.0.weight", "bbox_embed.3.layers.0.bias", "bbox_embed.3.layers.1.weight", "bbox_embed.3.layers.1.bias", "bbox_embed.3.layers.2.weight", "bbox_embed.3.layers.2.bias", "bbox_embed.4.layers.0.weight", "bbox_embed.4.layers.0.bias", "bbox_embed.4.layers.1.weight", "bbox_embed.4.layers.1.bias", "bbox_embed.4.layers.2.weight", "bbox_embed.4.layers.2.bias", "bbox_embed.5.layers.0.weight", "bbox_embed.5.layers.0.bias", "bbox_embed.5.layers.1.weight", "bbox_embed.5.layers.1.bias", "bbox_embed.5.layers.2.weight", "bbox_embed.5.layers.2.bias", "bbox_embed.6.layers.0.weight", "bbox_embed.6.layers.0.bias", "bbox_embed.6.layers.1.weight", "bbox_embed.6.layers.1.bias", "bbox_embed.6.layers.2.weight", "bbox_embed.6.layers.2.bias", "query_embed.weight", "input_proj.0.0.weight", "input_proj.0.0.bias", "input_proj.0.1.weight", "input_proj.0.1.bias", "input_proj.1.0.weight", "input_proj.1.0.bias", "input_proj.1.1.weight", "input_proj.1.1.bias", "input_proj.2.0.weight", "input_proj.2.0.bias", "input_proj.2.1.weight", "input_proj.2.1.bias", "input_proj.3.0.weight", "input_proj.3.0.bias", "input_proj.3.1.weight", "input_proj.3.1.bias", "backbone.0.body.patch_embed.proj.weight", "backbone.0.body.patch_embed.proj.bias", "backbone.0.body.patch_embed.norm.weight", "backbone.0.body.patch_embed.norm.bias", "backbone.0.body.layers.0.blocks.0.norm1.weight", "backbone.0.body.layers.0.blocks.0.norm1.bias", "backbone.0.body.layers.0.blocks.0.attn.relative_position_bias_table", "backbone.0.body.layers.0.blocks.0.attn.relative_position_index", "backbone.0.body.layers.0.blocks.0.attn.qkv.weight", "backbone.0.body.layers.0.blocks.0.attn.qkv.bias", "backbone.0.body.layers.0.blocks.0.attn.proj.weight", "backbone.0.body.layers.0.blocks.0.attn.proj.bias", "backbone.0.body.layers.0.blocks.0.norm2.weight", "backbone.0.body.layers.0.blocks.0.norm2.bias", "backbone.0.body.layers.0.blocks.0.mlp.fc1.weight", "backbone.0.body.layers.0.blocks.0.mlp.fc1.bias", "backbone.0.body.layers.0.blocks.0.mlp.fc2.weight", "backbone.0.body.layers.0.blocks.0.mlp.fc2.bias", "backbone.0.body.layers.0.blocks.1.norm1.weight", "backbone.0.body.layers.0.blocks.1.norm1.bias", "backbone.0.body.layers.0.blocks.1.attn.relative_position_bias_table", "backbone.0.body.layers.0.blocks.1.attn.relative_position_index", "backbone.0.body.layers.0.blocks.1.attn.qkv.weight", "backbone.0.body.layers.0.blocks.1.attn.qkv.bias", "backbone.0.body.layers.0.blocks.1.attn.proj.weight", "backbone.0.body.layers.0.blocks.1.attn.proj.bias", "backbone.0.body.layers.0.blocks.1.norm2.weight", "backbone.0.body.layers.0.blocks.1.norm2.bias", "backbone.0.body.layers.0.blocks.1.mlp.fc1.weight", "backbone.0.body.layers.0.blocks.1.mlp.fc1.bias", "backbone.0.body.layers.0.blocks.1.mlp.fc2.weight", "backbone.0.body.layers.0.blocks.1.mlp.fc2.bias", "backbone.0.body.layers.0.downsample.reduction.weight", "backbone.0.body.layers.0.downsample.norm.weight", "backbone.0.body.layers.0.downsample.norm.bias", "backbone.0.body.layers.1.blocks.0.norm1.weight", "backbone.0.body.layers.1.blocks.0.norm1.bias", "backbone.0.body.layers.1.blocks.0.attn.relative_position_bias_table", "backbone.0.body.layers.1.blocks.0.attn.relative_position_index", "backbone.0.body.layers.1.blocks.0.attn.qkv.weight", "backbone.0.body.layers.1.blocks.0.attn.qkv.bias", "backbone.0.body.layers.1.blocks.0.attn.proj.weight", "backbone.0.body.layers.1.blocks.0.attn.proj.bias", "backbone.0.body.layers.1.blocks.0.norm2.weight", "backbone.0.body.layers.1.blocks.0.norm2.bias", "backbone.0.body.layers.1.blocks.0.mlp.fc1.weight", "backbone.0.body.layers.1.blocks.0.mlp.fc1.bias", "backbone.0.body.layers.1.blocks.0.mlp.fc2.weight", "backbone.0.body.layers.1.blocks.0.mlp.fc2.bias", "backbone.0.body.layers.1.blocks.1.norm1.weight", "backbone.0.body.layers.1.blocks.1.norm1.bias", "backbone.0.body.layers.1.blocks.1.attn.relative_position_bias_table", "backbone.0.body.layers.1.blocks.1.attn.relative_position_index", "backbone.0.body.layers.1.blocks.1.attn.qkv.weight", "backbone.0.body.layers.1.blocks.1.attn.qkv.bias", "backbone.0.body.layers.1.blocks.1.attn.proj.weight", "backbone.0.body.layers.1.blocks.1.attn.proj.bias", "backbone.0.body.layers.1.blocks.1.norm2.weight", "backbone.0.body.layers.1.blocks.1.norm2.bias", "backbone.0.body.layers.1.blocks.1.mlp.fc1.weight", "backbone.0.body.layers.1.blocks.1.mlp.fc1.bias", "backbone.0.body.layers.1.blocks.1.mlp.fc2.weight", "backbone.0.body.layers.1.blocks.1.mlp.fc2.bias", "backbone.0.body.layers.1.downsample.reduction.weight", "backbone.0.body.layers.1.downsample.norm.weight", "backbone.0.body.layers.1.downsample.norm.bias", "backbone.0.body.layers.2.blocks.0.norm1.weight", "backbone.0.body.layers.2.blocks.0.norm1.bias", "backbone.0.body.layers.2.blocks.0.attn.relative_position_bias_table", "backbone.0.body.layers.2.blocks.0.attn.relative_position_index", "backbone.0.body.layers.2.blocks.0.attn.qkv.weight", "backbone.0.body.layers.2.blocks.0.attn.qkv.bias", "backbone.0.body.layers.2.blocks.0.attn.proj.weight", "backbone.0.body.layers.2.blocks.0.attn.proj.bias", "backbone.0.body.layers.2.blocks.0.norm2.weight", "backbone.0.body.layers.2.blocks.0.norm2.bias", "backbone.0.body.layers.2.blocks.0.mlp.fc1.weight", "backbone.0.body.layers.2.blocks.0.mlp.fc1.bias", "backbone.0.body.layers.2.blocks.0.mlp.fc2.weight", "backbone.0.body.layers.2.blocks.0.mlp.fc2.bias", "backbone.0.body.layers.2.blocks.1.norm1.weight", "backbone.0.body.layers.2.blocks.1.norm1.bias", "backbone.0.body.layers.2.blocks.1.attn.relative_position_bias_table", "backbone.0.body.layers.2.blocks.1.attn.relative_position_index", "backbone.0.body.layers.2.blocks.1.attn.qkv.weight", "backbone.0.body.layers.2.blocks.1.attn.qkv.bias", "backbone.0.body.layers.2.blocks.1.attn.proj.weight", "backbone.0.body.layers.2.blocks.1.attn.proj.bias", "backbone.0.body.layers.2.blocks.1.norm2.weight", "backbone.0.body.layers.2.blocks.1.norm2.bias", "backbone.0.body.layers.2.blocks.1.mlp.fc1.weight", "backbone.0.body.layers.2.blocks.1.mlp.fc1.bias", "backbone.0.body.layers.2.blocks.1.mlp.fc2.weight", "backbone.0.body.layers.2.blocks.1.mlp.fc2.bias", "backbone.0.body.layers.2.blocks.2.norm1.weight", "backbone.0.body.layers.2.blocks.2.norm1.bias", "backbone.0.body.layers.2.blocks.2.attn.relative_position_bias_table", "backbone.0.body.layers.2.blocks.2.attn.relative_position_index", "backbone.0.body.layers.2.blocks.2.attn.qkv.weight", "backbone.0.body.layers.2.blocks.2.attn.qkv.bias", "backbone.0.body.layers.2.blocks.2.attn.proj.weight", "backbone.0.body.layers.2.blocks.2.attn.proj.bias", "backbone.0.body.layers.2.blocks.2.norm2.weight", "backbone.0.body.layers.2.blocks.2.norm2.bias", "backbone.0.body.layers.2.blocks.2.mlp.fc1.weight", "backbone.0.body.layers.2.blocks.2.mlp.fc1.bias", "backbone.0.body.layers.2.blocks.2.mlp.fc2.weight", "backbone.0.body.layers.2.blocks.2.mlp.fc2.bias", "backbone.0.body.layers.2.blocks.3.norm1.weight", "backbone.0.body.layers.2.blocks.3.norm1.bias", "backbone.0.body.layers.2.blocks.3.attn.relative_position_bias_table", "backbone.0.body.layers.2.blocks.3.attn.relative_position_index", "backbone.0.body.layers.2.blocks.3.attn.qkv.weight", "backbone.0.body.layers.2.blocks.3.attn.qkv.bias", "backbone.0.body.layers.2.blocks.3.attn.proj.weight", "backbone.0.body.layers.2.blocks.3.attn.proj.bias", "backbone.0.body.layers.2.blocks.3.norm2.weight", "backbone.0.body.layers.2.blocks.3.norm2.bias", "backbone.0.body.layers.2.blocks.3.mlp.fc1.weight", "backbone.0.body.layers.2.blocks.3.mlp.fc1.bias", "backbone.0.body.layers.2.blocks.3.mlp.fc2.weight", "backbone.0.body.layers.2.blocks.3.mlp.fc2.bias", "backbone.0.body.layers.2.blocks.4.norm1.weight", "backbone.0.body.layers.2.blocks.4.norm1.bias", "backbone.0.body.layers.2.blocks.4.attn.relative_position_bias_table", "backbone.0.body.layers.2.blocks.4.attn.relative_position_index", "backbone.0.body.layers.2.blocks.4.attn.qkv.weight", "backbone.0.body.layers.2.blocks.4.attn.qkv.bias",
(...)

hyperparameters

Hi,
Great work. We found that the hyperparameters reported in your paper are slightly different from the settings in the code. Which settings should we follow for training?

A problem regarding inference and visulization

Hi Sir,
Thanks for your great work!
When I run the inference code, I met the following issue, could you pls give me any advice?

=> Start from a randomly initialised model
Traceback (most recent call last):
File "inference.py", line 164, in
main(args)
File "/home/user/miniconda/envs/torch/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "inference.py", line 121, in main
image, output[0], attn_weights[0],
IndexError: list index out of range
(torch) ➜ pvic git:(main) ✗

NaN when I use a DETR that I fine tuned

Hello, I have a problem when trying to use pvic with a “modified” DETR, I performed a fine tune on the DETR which is found within its repository, when running with the base DETR it works, but with the DETR weights modified by running as follows:

unnamed

It ends up failing and giving this error:

Capturar2

I found my problem similar to the last two comments on this problem here

Do you have any suggestions on how I can solve this issue?

some question about hyperparameters of PViC with resnet50

Thank you for your excellent work!!
When using your code on hico-det, the mAP with the swin-large backbone is normal, but with the resnet50 backbone, the mAP is only 33.6. I'm wondering if this could be due to the hyperparameters? I cloned your repository quite early; have you made any updates since then? Thank you again for your great work.

Get grad of inputs

Hello, I respectfully appreciate the work you have done. I encountered the following issue during getting grad of inputs, and I would greatly appreciate your help in solving it.

function def _on_each_iteration(self): in util.py

self._state.loss = sum(loss for loss in loss_dict.values())
        self._state.optimizer.zero_grad(set_to_none=True)
        self._state.loss.backward()

        #To get inputs.grad
        grad = self._state.inputs[0][0].grad

but grad is None, I want to know how to get the correct the grad.and I would greatly appreciate your help in solving it.
@fredzzhang

Random seeds in training

Hi,
We found that the random seed in the PVIC code fixes the first test result, but subsequent training results produce variations. Here is what happens when the same code is run twice.

Namespace(alpha=0.5, aux_loss=True, backbone='resnet101', batch_size=16, bbox_loss_coef=5, box_score_thresh=0.05, cache=False, clip_max_norm=0.1, data_root='./hicodet', dataset='hicodet', dec_layers=6, detector='base', device='cuda', dilation=False, dim_feedforward=2048, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=30, eval=False, gamma=0.1, giou_loss_coef=2, hidden_dim=256, kv_src='C5', lr_drop=20, lr_drop_factor=0.2, lr_head=0.0001, max_instances=15, min_instances=3, nheads=8, num_queries=100, num_workers=2, output_dir='outputs/dub', partitions=['train2015', 'test2015'], port='1234', position_embedding='sine', pre_norm=False, pretrained='checkpoints/detr-r101-hicodet.pth', print_interval=100, raw_lambda=2.8, repr_dim=384, resume='', sanity=False, seed=140, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, triplet_dec_layers=2, triplet_enc_layers=1, use_wandb=False, weight_decay=0.0001, world_size=2)
Rank 1: Load weights for the object detector from checkpoints/detr-r101-hicodet.pth
=> Rank 1: PViC randomly initialised.
Rank 0: Load weights for the object detector from checkpoints/detr-r101-hicodet.pth
=> Rank 0: PViC randomly initialised.
Epoch 0 =>	mAP: 0.1483, rare: 0.1011, none-rare: 0.1624.
Epoch [1/30], Iter. [0100/2352], Loss: 3.8620, Time[Data/Iter.]: [6.48s/200.77s]
Epoch [1/30], Iter. [0200/2352], Loss: 2.2175, Time[Data/Iter.]: [0.12s/198.64s]
Epoch [1/30], Iter. [0300/2352], Loss: 2.1058, Time[Data/Iter.]: [0.12s/197.65s]
Epoch [1/30], Iter. [0400/2352], Loss: 1.9482, Time[Data/Iter.]: [0.12s/200.41s]
Epoch [1/30], Iter. [0500/2352], Loss: 1.8276, Time[Data/Iter.]: [0.12s/199.10s]
Epoch [1/30], Iter. [0600/2352], Loss: 1.7830, Time[Data/Iter.]: [0.12s/195.86s]
Epoch [1/30], Iter. [0700/2352], Loss: 1.7758, Time[Data/Iter.]: [0.12s/193.67s]
Epoch [1/30], Iter. [0800/2352], Loss: 1.7299, Time[Data/Iter.]: [0.12s/197.73s]
Epoch [1/30], Iter. [0900/2352], Loss: 1.6942, Time[Data/Iter.]: [0.12s/199.21s]
Epoch [1/30], Iter. [1000/2352], Loss: 1.6837, Time[Data/Iter.]: [0.12s/195.18s]
Epoch [1/30], Iter. [1100/2352], Loss: 1.6410, Time[Data/Iter.]: [0.12s/198.79s]
Epoch [1/30], Iter. [1200/2352], Loss: 1.6846, Time[Data/Iter.]: [0.12s/193.91s]
Epoch [1/30], Iter. [1300/2352], Loss: 1.6586, Time[Data/Iter.]: [0.12s/197.35s]
Epoch [1/30], Iter. [1400/2352], Loss: 1.6119, Time[Data/Iter.]: [0.12s/199.18s]
Epoch [1/30], Iter. [1500/2352], Loss: 1.6100, Time[Data/Iter.]: [0.12s/195.75s]
Epoch [1/30], Iter. [1600/2352], Loss: 1.6113, Time[Data/Iter.]: [0.12s/197.37s]
Epoch [1/30], Iter. [1700/2352], Loss: 1.5859, Time[Data/Iter.]: [0.12s/194.13s]
Epoch [1/30], Iter. [1800/2352], Loss: 1.6008, Time[Data/Iter.]: [0.12s/198.77s]
Epoch [1/30], Iter. [1900/2352], Loss: 1.5268, Time[Data/Iter.]: [0.12s/195.72s]
Epoch [1/30], Iter. [2000/2352], Loss: 1.5740, Time[Data/Iter.]: [0.12s/196.66s]
Epoch [1/30], Iter. [2100/2352], Loss: 1.5322, Time[Data/Iter.]: [0.12s/199.11s]
Epoch [1/30], Iter. [2200/2352], Loss: 1.5152, Time[Data/Iter.]: [0.12s/199.33s]
Epoch [1/30], Iter. [2300/2352], Loss: 1.5533, Time[Data/Iter.]: [0.12s/196.02s]
Epoch 1 =>	mAP: 0.3168, rare: 0.3048, none-rare: 0.3204.
Namespace(alpha=0.5, aux_loss=True, backbone='resnet101', batch_size=16, bbox_loss_coef=5, box_score_thresh=0.05, cache=False, clip_max_norm=0.1, data_root='./hicodet', dataset='hicodet', dec_layers=6, detector='base', device='cuda', dilation=False, dim_feedforward=2048, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=30, eval=False, gamma=0.1, giou_loss_coef=2, hidden_dim=256, kv_src='C5', lr_drop=20, lr_drop_factor=0.2, lr_head=0.0001, max_instances=15, min_instances=3, nheads=8, num_queries=100, num_workers=2, output_dir='outputs/dub', partitions=['train2015', 'test2015'], port='1234', position_embedding='sine', pre_norm=False, pretrained='checkpoints/detr-r101-hicodet.pth', print_interval=100, raw_lambda=2.8, repr_dim=384, resume='', sanity=False, seed=140, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, triplet_dec_layers=2, triplet_enc_layers=1, use_wandb=False, weight_decay=0.0001, world_size=2)
Rank 1: Load weights for the object detector from checkpoints/detr-r101-hicodet.pth
=> Rank 1: PViC randomly initialised.
Rank 0: Load weights for the object detector from checkpoints/detr-r101-hicodet.pth
=> Rank 0: PViC randomly initialised.
Epoch 0 =>	mAP: 0.1483, rare: 0.1011, none-rare: 0.1624.
Epoch [1/30], Iter. [0100/2352], Loss: 3.8620, Time[Data/Iter.]: [6.28s/201.51s]
Epoch [1/30], Iter. [0200/2352], Loss: 2.2173, Time[Data/Iter.]: [0.12s/199.69s]
Epoch [1/30], Iter. [0300/2352], Loss: 2.1066, Time[Data/Iter.]: [0.12s/199.08s]
Epoch [1/30], Iter. [0400/2352], Loss: 1.9485, Time[Data/Iter.]: [0.12s/201.27s]
Epoch [1/30], Iter. [0500/2352], Loss: 1.8270, Time[Data/Iter.]: [0.13s/199.80s]
Epoch [1/30], Iter. [0600/2352], Loss: 1.7837, Time[Data/Iter.]: [0.12s/196.83s]
Epoch [1/30], Iter. [0700/2352], Loss: 1.7743, Time[Data/Iter.]: [0.13s/194.57s]
Epoch [1/30], Iter. [0800/2352], Loss: 1.7293, Time[Data/Iter.]: [0.12s/198.28s]
Epoch [1/30], Iter. [0900/2352], Loss: 1.6914, Time[Data/Iter.]: [0.12s/200.15s]
Epoch [1/30], Iter. [1000/2352], Loss: 1.6790, Time[Data/Iter.]: [0.12s/196.27s]
Epoch [1/30], Iter. [1100/2352], Loss: 1.6371, Time[Data/Iter.]: [0.12s/199.65s]
Epoch [1/30], Iter. [1200/2352], Loss: 1.6850, Time[Data/Iter.]: [0.12s/194.97s]
Epoch [1/30], Iter. [1300/2352], Loss: 1.6531, Time[Data/Iter.]: [0.12s/198.41s]
Epoch [1/30], Iter. [1400/2352], Loss: 1.6091, Time[Data/Iter.]: [0.12s/200.63s]
Epoch [1/30], Iter. [1500/2352], Loss: 1.6101, Time[Data/Iter.]: [0.12s/196.76s]
Epoch [1/30], Iter. [1600/2352], Loss: 1.6109, Time[Data/Iter.]: [0.12s/198.47s]
Epoch [1/30], Iter. [1700/2352], Loss: 1.5908, Time[Data/Iter.]: [0.12s/195.29s]
Epoch [1/30], Iter. [1800/2352], Loss: 1.6023, Time[Data/Iter.]: [0.12s/199.69s]
Epoch [1/30], Iter. [1900/2352], Loss: 1.5266, Time[Data/Iter.]: [0.12s/196.88s]
Epoch [1/30], Iter. [2000/2352], Loss: 1.5728, Time[Data/Iter.]: [0.13s/198.12s]
Epoch [1/30], Iter. [2100/2352], Loss: 1.5333, Time[Data/Iter.]: [0.12s/200.25s]
Epoch [1/30], Iter. [2200/2352], Loss: 1.5144, Time[Data/Iter.]: [0.12s/200.74s]
Epoch [1/30], Iter. [2300/2352], Loss: 1.5541, Time[Data/Iter.]: [0.12s/197.17s]
Epoch 1 =>	mAP: 0.3156, rare: 0.3004, none-rare: 0.3201.

About pre-trained weight of object detector

Thanks for your great work! I have one question for your pretrained weight of object detector when training ViC-H-Defm-DETR-SwinL on HICO-DET, the weight of h-detr: h-defm-detr-swinL-dp0-mqs-lft-iter-2stg-hicodet.pth, how should i choose it from the model zoo in https://github.com/HDETR/H-Deformable-DETR, because there are many models satisfies backbone swin-large, or can i just randomly pick one of them? Thanks !

What's the module "upt" in inference.py?

I try to run the inference.py to to test the model and get error ModuleNotFoundError: No module named 'upt'. After running pip install upt and executing the inference.py again, the program prints error: ImportError: cannot import name 'build_detector' from 'upt' (/opt/conda/envs/pvic/lib/python3.8/site-packages/upt/__init__.py).

I think I misunderstand the source of upt. And I find a UPT model from UPT: Unary–Pairwise Transformers. Is this source right? If so, how to combine it with your project?

PViC Fine-tune

Hello, thank you for the amazing work with the PViC.

Could you help me? I've conducted training with PViC on my own images, and it works well, but I would like to fine-tune this model with new images, is it possible?

(Other question: there is a reason to use test partition to train on code? -> ap = self.test_hico())
Thank you!

The loss is NaN during training

Hi, thanks for your great work. I am using the DETR model (from UPT) to train on a GPU with a batchsize of 16. The training loss is NaN and the problem is still not solved after using different random seeds. Can you give me any advice?

Namespace(alpha=0.5, aux_loss=True, backbone='resnet50', batch_size=16,  detector='base', epochs=30, world_size=1)
Rank 0: Load weights for the object detector from checkpoints/detr-r50-hicodet.pth
=> Rank 0: PViC randomly initialised.

Epoch 0 =>	mAP: 0.1359, rare: 0.0912, none-rare: 0.1493.

Epoch [1/30], Iter. [0100/2353], Loss: 3.9044, Time[Data/Iter.]: [10.76s/153.78s]
...
Epoch [1/30], Iter. [2300/2353], Loss: 1.5333, Time[Data/Iter.]: [0.63s/147.35s]
...
100%|██████████| 597/597 [19:03<00:00,  1.92s/it]
Epoch 1 =>	mAP: 0.0127, rare: 0.0110, none-rare: 0.0132.

Traceback (most recent call last):
  File "main.py", line 195, in <module>
    mp.spawn(main, nprocs=args.world_size, args=(args,))
  File "/usr/local/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/quan/pvic/main.py", line 130, in main
    engine(args.epochs)
  File "/home/quan/pvic/pocket/core/distributed.py", line 139, in __call__
    self._on_each_iteration()
  File "/home/quan/pvic/utils.py", line 195, in _on_each_iteration
    raise ValueError(f"The HOI loss is NaN for rank {self._rank}")
ValueError: The HOI loss is NaN for rank 0

RuntimeError: CUDA error: invalid device ordinal

Hello
Thanks for your pretty code.
When I run ! DETR=base python main.py --pretrained checkpoints/detr-r50-hicodet.pth \ --output-dir outputs/pvic-detr-r50-hicodet
for training, I get the following error:
/content/drive/MyDrive/pvic
Namespace(backbone='resnet50', dilation=False, position_embedding='sine', hidden_dim=256, enc_layers=6, dec_layers=6, dim_feedforward=2048, dropout=0.1, nheads=8, num_queries=100, pre_norm=False, lr_head=0.0001, lr_drop=20, lr_drop_factor=0.2, epochs=30, batch_size=16, weight_decay=0.0001, clip_max_norm=0.1, aux_loss=True, set_cost_class=1, set_cost_bbox=5, set_cost_giou=2, bbox_loss_coef=5, giou_loss_coef=2, eos_coef=0.1, device='cuda', dataset='hicodet', partitions=['train2015', 'test2015'], num_workers=2, data_root='./hicodet', output_dir='outputs/pvic-detr-r50-hicodet', pretrained='checkpoints/detr-r50-hicodet.pth', print_interval=100, detector='base', raw_lambda=2.8, kv_src='C5', repr_dim=384, triplet_enc_layers=1, triplet_dec_layers=2, alpha=0.5, gamma=0.1, box_score_thresh=0.05, min_instances=3, max_instances=15, resume='', use_wandb=False, port='1234', seed=140, world_size=8, eval=False, cache=False, sanity=False)
[W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address).
Traceback (most recent call last):
File "/content/drive/MyDrive/pvic/main.py", line 193, in
mp.spawn(main, nprocs=args.world_size, args=(args,))
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
fn(i, *args)
File "/content/drive/MyDrive/pvic/main.py", line 43, in main
torch.cuda.set_device(rank)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Experimental results on V-COCO

Hello, when will your training and evaluation code on the V-COCO data set be updated? My recent work requires your code very much. I will be very grateful if you can update it! !

Timeline for upadating Readme.md

I am puzzled by the difference between my reproduced results on v-coco and the results reported in the article. I use 2 4090 GPU with batchsize=32 for reproducing.
image

Mistake when skipping images.

Hi @fredzzhang,

We have a tricky problem.

            if len(x_keep) == 0:
                ho_queries.append(torch.zeros(0, self.repr_size, device=device))
                paired_indices.append(torch.zeros(0, 2, device=device, dtype=torch.int64))
                prior_scores.append(torch.zeros(0, 2, self.num_verbs, device=device))
                object_types.append(torch.zeros(0, device=device, dtype=torch.int64))
                positional_embeds.append({})
                continue

The absence of human is dealt with here. (x_keep==0)

q_p = self.q_attn_qpos_proj(q_pos["box"])

        k_p = self.q_attn_kpos_proj(q_pos["box"])

When x_keep==0 is an empty dictionary, it results in:

    cc = self.clipdecoder(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/autodl-tmp/pvicx/transformers.py", line 905, in forward
    pair = q_pos["pair_spatial"]
KeyError: 'pair_spatial'

Here "pair_spatial" denotes the features formed by the concatenation of human and object (from DETR). The feature encounters x_keep==0 and reports an error. We have found that your positional_embeds will also have x_keep==0, but your code works fine.

We try to rewrite positional_embeds.append({}) as a tensor type for positional_embeds.append(torch.zeros(0, 1, self.repr_size, device=device)), but this null tensor causes subsequent code to continue to report errors. How should I modify this part of the code to work correctly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.