Code Monkey home page Code Monkey logo

idea-pytorch's Introduction

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training (ACM MM2022)

Introduction

Official PyTorch Implementation of the paper IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Xinyu Huang, Youcai Zhang, Ying Cheng, Weiwei Tian, Ruiwei Zhao, Rui Feng*, Yuejie Zhang*, Yaqian Li, Yandong Guo, Xiaobo Zhang*
Fudan University, OPPO Research Institute, Shanghai Key Laboratory of Intelligent Information Processing

Abstract

Training vision-language models with image-text pairs co-occurrent on the Internet is suboptimal, as such supervision typically lacks explicit alignment information. We propose IDEA to provide more explicit supervision (including multiple valuable tags and texts composed by multiple tags). Our IDEA jointly trains multi-label recognition with tags from texts and identifies additional tags online.

Example results

Text refers to the original co-occurrent texts with the image. Tags refer to the identified tags by IDEA, including objects, scenes, attributes, actions, etc. These image tags are entirely learned from the texts and recognized online.

Credit to previous work

This repository is built upon the amazing code base of ALBEF, CLIP and ML-decoder, thanks very much!

Dataset

Our IDEA jontly trains multi-label recognition with the tags extracted from the text according to the tag list. We provide an example of training json file with VG&COCO dataset in configs/vg_coco.json and the tag list is in dataset/class_config.py. Please change the data path in configs/vg_coco.json to your own path of the file before training.

You can also build your own tag list and use other extraction methods. You can prepare custom dataset json file where each json file contains a list. Each item in the list is a dictonary with three key-value pairs: {'image': path_of_image, 'caption': text_of_image, 'tag': tags_from_text}. In configs/Pretrain.yaml, set the paths for the json files.

Training Code

Pre-train the model of IDEA using 8 V100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain.py --config ./configs/Pretrain.yaml --output_dir output/idea/ --model idea

Pre-train the model of CLIP using 8 V100 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain.py --config ./configs/Pretrain.yaml --output_dir output/clip/ --model clip

Zero-Shot Evaluation

python evaluation.py --checkpoint_path {path to model for evaluation} --dataset {path to ImageNet validaion set}

Citation

If you find this repository to be useful for your research, please consider citing.

@inproceedings{huang2022idea,
  title={IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training},
  author={Huang, Xinyu and Zhang, Youcai and Cheng, Ying and Tian, Weiwei and Zhao, Ruiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Xiaobo},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  pages={4573--4583},
  year={2022}
}

idea-pytorch's People

Contributors

xinyu1205 avatar

Stargazers

Jeff Carpenter avatar  avatar Licong Guan avatar Mohammad Reza Taesiri avatar  avatar Lucus avatar  avatar Weiwei Tian avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.