Code Monkey home page Code Monkey logo

rlipv2's Introduction

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Accepted to ICCV 2023 🥳

arXiv GitHub Stars GitHub Forks Hits

colored_mesh (1)

Abstract: Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (\textit{e.g.,} BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data.

Todo list

Note that if you can not get access to the links provided below (OneDrive is super unstable), try using another browser, raising an issue, or contacting me by e-mail. I am happy to provide any assistance.

  • 🎉 Release code for pre-training, fine-tuning and inference.
  • 🎉 Release pre-training and fine-tuning annotations.
  • 🎉 Release checkpoints for pre-training, few-shot, zero-shot and fine-tuning.

Information before using this repo

I changed all the paths to prevent from possible information leakage. In order to run the code, you will need to configure the paths to match your own system. To do this, search for the "/PATH/TO" placeholder in the code and replace it with the appropriate file path on your system. ⭐⭐⭐Consider starring the repo! ⭐⭐⭐

Environment setup

I recommend creating a new conda environment in order to run the code. You can check scripts/create_environment.txt to acquire details on how to set up the environment.

Model outline

This repo contains the implementation of various methods to resolve HOI detection (not limited to RLIP), aiming to serve as a benchmark for HOI detection. Below methods are included in this repo:

  • RLIPv2-ParSeDA (model name in the repo: RLIP_ParSeDA_v2);
  • RLIPv2-ParSeD (model name in the repo: RLIP_ParSeD_v2);
  • RLIP-ParSe (model name in the repo: RLIP-ParSe);
  • ParSe (model name in the repo: ParSe);
  • RLIP-ParSeD (model name in the repo: RLIP-ParSeD);
  • ParSeD (model name in the repo: ParSeD);
  • OCN (model name in the repo: OCN), which is a prior work of RLIP;
  • QPIC (model name in the repo: DETRHOI);
  • QAHOI (model name in the repo: DDETRHOI);
  • CDN (model name in the repo: CDN);

Citation

@inproceedings{Yuan2023RLIPv2,
  title={RLIPv2: Fast Scaling of Relational Language-Image Pre-training},
  author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Albanie, Samuel and Pan, Yining and Feng, Tao and Jiang, Jianwen and Ni, Dong and Zhang, Yingya and Zhao, Deli},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2023}
}

@inproceedings{Yuan2022RLIP,
  title={RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection},
  author={Yuan, Hangjie and Jiang, Jianwen and Albanie, Samuel and Feng, Tao and Huang, Ziyuan and Ni, Dong and Tang, Mingqian},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2022}
}

@inproceedings{Yuan2022OCN,
  title={Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics},
  author={Hangjie Yuan and Mang Wang and Dong Ni and Liangpeng Xu},
  booktitle={AAAI},
  year={2022}
}

Annotation preparation

Dataset Setting Download
VG RLIP Link
COCO (pseudo) RLIP Link
Objects365 (pseudo) RLIP Link
Open Images Fully-finetuning Link
HICO-DET Few-shot 1%, 10% Link
HICO-DET Zero-shot (UC-NF, UC-RF)* Link

Note: ① * Zero-shot (NF) do not need any HICO-DET annotations for fine-tuning, so we only provide training annotations for the UC-NF and UC-RF setting.

Pre-training datasets preparation

1. Visual Genome

Firstly, we could download VG dataset from the official link, inclduing images Part I and Part II. (Note: If the official website is not working, you can use the link that I provide: Images and Images2.) The annotations after pre-processing could be downloaded from the link above, which is used for pre-training. Note that this is generated from scene_graphs.json file by several pre-processing steps to remove redundant triplets. Also, several settings mentioned below also need the annotations that we provide. VG dataset and its corresponding annotations should be organized as follows:

VG
 |─ annotations
 |   |— scene_graphs_after_preprocessing.json
 |   :
 |— images
 |   |— 2409818.jpg
 |   |— n102412.jpg
 :   :

2. COCO

Firstly, try downloading the COCO2017 dataset from the official link. If you want to run R-Tagger, you need to download the bounding box annotations from the website as well. If you just want to perform relational pre-training, you can merely download the pseudo-annotations for COCO2017. The dataset should be organized as follows:

COCO2017
 |— annotations
 |   |— instances_train2017.json
 |   |— instances_val2017.json
 |   └─ RLIPv2_train2017_threshold20.....json
 |   
 |— train2017
 |   |— 000000498666.jpg
 |   :
 |
 |— val2017
 |   |— 000000414261.jpg
 :   :

3. Objects365

Firstly, download the Objects365 dataset from the official link. This dataset contains 51 training patches and 44 validation patches, which are summed to more than 1700k images used for pre-training. Similarly, if you want to run R-Tagger, you need to download the bounding box annotations from the website as well. If you just want to perform relational pre-training, you can merely download the pseudo-annotations for Objects365. (Btw, you can try use the script in scripts/datasets folder.) The dataset should be organized as follows:

Objects365
 |— train 
 |   |— patch0
 |   |— patch1
 |   :
 |   |— patch50
 |   └─ zhiyuan_objv2_train.json
 |
 |— val
 |   |— patch0
 |   |— patch1
 |   :
 |   |— patch43
 |   └─ zhiyuan_objv2_val.json
 |
 |— rel_annotations
 |    └─ RLIPv2_o365trainval_Tagger2.....json
 |
 └─ image_id_to_filepath.json

Downstream dataset preparation

1. HICO-DET

HICO-DET dataset can be downloaded here. After finishing downloading, unpack the tarball (hico_20160224_det.tar.gz) to the data directory.

Instead of using the original annotations files, we use the annotation files provided by the PPDM authors. The annotation files can be downloaded from here. The downloaded annotation files have to be placed as follows.

qpic
 |─ data
 │   └─ hico_20160224_det
 |       |─ annotations
 |       |   |─ trainval_hico.json
 |       |   |─ test_hico.json
 |       |   └─ corre_hico.npy
 :       :

2. V-COCO

First clone the repository of V-COCO from here, and then follow the instruction to generate the file instances_vcoco_all_2014.json. Next, download the prior file prior.pickle from here. Place the files and make directories as follows.

qpic
 |─ data
 │   └─ v-coco
 |       |─ data
 |       |   |─ instances_vcoco_all_2014.json
 |       |   :
 |       |─ prior.pickle
 |       |─ images
 |       |   |─ train2014
 |       |   |   |─ COCO_train2014_000000000009.jpg
 |       |   |   :
 |       |   └─ val2014
 |       |       |─ COCO_val2014_000000000042.jpg
 |       |       :
 |       |─ annotations
 :       :

The annotation file has to be converted to the HOIA format. The conversion can be conducted as follows.

PYTHONPATH=data/v-coco \
        python convert_vcoco_annotations.py \
        --load_path data/v-coco/data \
        --prior_path data/v-coco/prior.pickle \
        --save_path data/v-coco/annotations

Note that only Python2 can be used for this conversion because vsrl_utils.py in the v-coco repository shows a error with Python3.

V-COCO annotations with the HOIA format, corre_vcoco.npy, test_vcoco.json, and trainval_vcoco.json will be generated to annotations directory.

3. Open Images v6

Open Images v6 can be downloaded from this link. We transform the annotations to the HICO-DET format, which can be downloaded from the link provided [above](## Annotation Preparation). The dataset should be organized as follows:

Open Images v6
 |
 |─ images
 |    |─ ca5267a6336b71ea.jpg
 |    :
 |
 └─ annotations

Relational Language-Image Pre-training

We provide a series of pre-trained weights for you to use. First of all, we provide weights after relational pre-training (RLIP weights) on VG+COCO+Objects365 using RLIPv2-ParSeDA.

Model Pre-training Paradigm Pre-training Dataset Backbone Download
RLIPv2-ParSeDA RLIP VG+COCO+O365 ResNet-50 Link
RLIPv2-ParSeDA RLIP VG+COCO+O365 Swin-T Link
RLIPv2-ParSeDA RLIP VG+COCO+O365 Swin-L Link

Secondly, we provide object detection weights (OD weights) on COCO+Objects365, used to initialize RLIPv2-ParSeDA for RLIP.

Model Pre-training Paradigm Pre-training Dataset Backbone Download
RLIPv2-ParSeDA OD COCO ResNet-50 Link
RLIPv2-ParSeDA OD COCO+O365 ResNet-50 Link
RLIPv2-ParSeDA OD COCO Swin-T Link
RLIPv2-ParSeDA OD COCO+O365 Swin-T Link
RLIPv2-ParSeDA OD COCO Swin-L Link
RLIPv2-ParSeDA OD COCO+O365 Swin-L Link

Note that all the scripts used for pre-training RLIPv2 are presented under scripts/RLIP_ParSeDA. For instance, train_RLIP_ParSeDA_v2_mixed_vgcocoo365_swinL.sh means that this script is responsible for pre-training on mixed datasets of VG+COCO+O365 using Swin-L and RLIPv2-ParSeDA.

Downstream tasks

We provide a series of checkpoints below, which could be used for reproducing results in the paper. Note that all the scripts used for fine-tuning RLIPv2 are presented under scripts/RLIP_ParSeDA. For instance, fine_tune_RLIP_ParSeDA_v2_hico_swinL_few-shot.sh means that this script is responsible for few-shot transfer on HICO-DET using Swin-L and RLIPv2-ParSeDA.

Fully fine-tuning on HICO-DET

Model Backbone Rare / Non-Rare / Full Download
RLIPv2-ParSeDA ResNet-50 29.61 / 37.10 / 35.38 Link
RLIPv2-ParSeDA Swin-T 33.66 / 40.07 / 38.60 Link
RLIPv2-ParSeDA Swin-L 43.23 / 45.64 / 45.09 Link

Few-shot transfer on HICO-DET

Model Backbone Setting Rare / Non-Rare / Full Download
RLIPv2-ParSeDA ResNet-50 1% 22.13 / 24.51 / 23.96 Link
RLIPv2-ParSeDA ResNet-50 10% 23.28 / 30.02 / 28.46 Link
RLIPv2-ParSeDA Swin-T 1% 24.26 / 28.92 / 27.85 Link
RLIPv2-ParSeDA Swin-T 10% 28.31 / 32.93 / 31.87 Link
RLIPv2-ParSeDA Swin-L 1% 31.89 / 32.32 / 32.22 Link
RLIPv2-ParSeDA Swin-L 10% 34.75 / 38.27 / 37.46 Link

Zero-shot on HICO-DET

Model Backbone Setting Rare / Non-Rare / Full Download
RLIPv2-ParSeDA ResNet-50 NF 19.64 / 17.24 / 17.79 Link
RLIPv2-ParSeDA ResNet-50 UC-RF 21.45 / 35.85 / 32.97 Link
RLIPv2-ParSeDA ResNet-50 UC-NF 22.81 / 29.52 / 28.18 Link
RLIPv2-ParSeDA Swin-T NF 21.24 / 19.47 / 19.87 Link
RLIPv2-ParSeDA Swin-T UC-RF 26.95 / 39.92 / 37.32 Link
RLIPv2-ParSeDA Swin-T UC-NF 21.07 / 35.07 / 32.27 Link
RLIPv2-ParSeDA Swin-L NF 27.97 / 21.90 / 23.29 Link
RLIPv2-ParSeDA Swin-L UC-RF 31.23 / 45.01 / 42.26 Link
RLIPv2-ParSeDA Swin-L UC-NF 22.65 / 40.51 / 36.94 Link

Fully fine-tuning on V-COCO

Model Backbone AP_1 / AP_2 Download
RLIPv2-ParSeDA ResNet-50 65.9 / 68.0 Link
RLIPv2-ParSeDA Swin-T 68.8 / 70.8 Link
RLIPv2-ParSeDA Swin-L 72.1 / 74.1 Link

Fully fine-tuning on Open Images v6

Model Backbone R@50 / wmAP_rel / wmAP_phr / score_wtd Download
RLIPv2-ParSeDA ResNet-50 65.99 / 49.54 / 45.71 / 51.30 Link
RLIPv2-ParSeDA Swin-T 68.81 / 52.70 / 48.01 / 54.05 Link
RLIPv2-ParSeDA Swin-L 72.49 / 56.38 / 50.70 / 57.34 Link

Evaluation

The mAP on HICO-DET under the Full set, Rare set and Non-Rare Set will be reported during the training process.

The results for the official evaluation of V-COCO must be obtained by the generated pickle file of detection results. You can refer to test_vcoco_official.sh for more details on running generate_vcoco_official.py.

Then you should run following codes after modifying the path to get the final performance:

cd /PATH/TO/RLIP
python datasets/vsrl_eval.py

Relational pseudo-labelling

As detailed in the Figure 3 of the main paper, it involves (i) generating the captions using a captioner (i.e. BLIP), (ii) generating a relation candidate set and (iii) assigning relation texts to region pairs via an R-Tagger.

  • Step 1: run RLIP_caption_coco.py under the BLIP-main folder to generate captions. Code are available here.
  • Step 2: run transform_BLIP_sentences_to_triplets() function in datasets/rlipv2_helper/BLIP_coco_caption_helper.py to obtain scene graphs and run transform_BLIP_sngs_to_verb_tagger_input_format() function in datasets/rlipv2_helper/BLIP_coco_caption_helper.py to obtain relation candidate sets for images.
  • Step 3: run scripts/verb_tagger/test_Tagger_resnet.sh script (which runs generate_relations_using_verb_tagger.py) to assign relation texts to region pairs via R-Tagger. The parameters and generated captions can be downloaded from this link.

Acknowledgement

Part of this work's implemention refers to several prior works including RLIP, OCN, QPIC, CDN, DETR, DDETR, MDETR and GLIP.

rlipv2's People

Contributors

jacobyuan7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

rlipv2's Issues

question about Visual Genome dataset

Hi, i want to download the Visual Genome dataset, but i can't open the three links of the Visual Genome in readme. I can open this linkhttps://homes.cs.washington.edu/~ranjay/visualgenome/api.html. May I ask which version you used?
Thank you very much.

The link has expired!

Thanks for your great job!

The link you provided for downloading VG dataset has expired. Please update it!

OI Eval and Training Logs

Hi,

Fantastic work and thanks for releasing the codebase!

I noticed that your evaluation on Open Images seems different from other approaches. You are computing the mean per GT triplet, instead of per predicate class. You also skip computing for "unseen" triplet. I am not sure if these modifications would introduce big differences when comparing with other methods. Would you mind explain this?

I also would like to ask if you have the training logs that could share, especially on mAPs results for Open Images (object detection results per class, and predicate recall per class).

Thanks.

OD weight for pretraining missmatch

when I use the shell train_RLIP_ParSeDA_v2_mixed_vgcocoo365_swinL.sh to reproduce the pretraining process, I set the --pretrained to the OD weight I downloaded swin_large_cocoo365_bs64_lr141_drop_path0.5_dp0_mqs_lft_dab_deformable_detr_plus_iterative_bbox_refinement_36eps_converted.pth.
However, there are many missing_keys. I wonder which checkpoint is the correct one for this shell.

Links to the checkpoints expired

Hi! When I try to access the checkpoints in the README, sharepoint shows a message saying the link is expired. Could you update them maybe? Thanks!

problem of the v-coco fully finetuned model

Thanks for your brilliant work.
I download the v-coco fully finetuned model from the provided link in the readme. When I try to inference with the command
python generate_vcoco_official.py
--param_path /PATH/TO/CHECKPOINT
--save_path vcoco.pickle
--hoi_path /PATH/TO/VCOCO/DATA
There is an error:
Traceback (most recent call last):
File "generate_vcoco_official.py", line 594, in
main(args)
File "generate_vcoco_official.py", line 431, in main
load_info = model.load_state_dict(checkpoint['model'])
File "/XXX/anaconda3/envs/rlip/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DETRHOI:
Missing key(s) in state_dict: "transformer.encoder.layers.0.self_attn.in_proj_weight", "transformer.encoder.layers.0.self_attn.in_proj_bias", "transformer.encoder.layers.0.self_attn.out_proj.weight", "transformer.encoder.layers.0.self_attn.out_proj.bias", "transformer.encoder.layers.1.self_attn.in_proj_weight",
......
I try the Resnet50 and SwinT model, they both have the same problem, I wonder whether it is the problem of the downloaded checkponits.
I sincerely hope that you can help me figure out what is the problem.

Is there more pre-trained models?

Thanks for your excellent work!

I'm wondering if more pre-trained models will be released, such as the R-Tagger based on SwinL and the RLIP-ParSeD series.

Problem about the pretrained model

Hi,
Thanks for your nice work! When I ran the code, I came across the following error.

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data-nas/peizhi/params/resnet50-19c8e357.pth'

Also, I found that many models are under the folder '/mnt/data-nas/peizhi/', but I don't know where to get those models. I also don't find it in the environment configs.
I would appreciate any reply you can give me. Thank you!

How to inference HICO-DET fully fine-tuned model

Thanks for your help last time. I wonder how to inference HICO-DET fully fine-tuned checkpoint and reproduce the result in paper, since I haven't found the shell for it yet, like test_vcoco_official.sh?

Can't import "sng_parser"

Thanks for your great works!

I try to do "pseudo-labelling" job on my own data, so I run the scripts "RLIPv2/datasets/rlipv2_helper/BLIP_coco_caption_helper.py', but it raise the error "ModuleNotFoundError: No module named 'sng_parser'"

I look for the "snp_parse.py" file in all the project, but couldn't find it?

Would you tell me how to fix the problem??

THANK YOU VERY MUCH!!!

Download link for the generated captions jumps error

It appears that the download link for the generated captions has mistakenly jumped to the download page for Zero-shot (HICO) checkpoints.

Relational pseudo-labelling

  • Step 3: run scripts/verb_tagger/test_Tagger_resnet.sh script (which runs generate_relations_using_verb_tagger.py) to assign relation texts to region pairs via R-Tagger. The parameters and generated captions can be downloaded from this link.

Could you kindly check this for me and rectify the issue?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.