Light

phellonchen / awesome-vision-and-language-pre-training Goto Github PK

View Code? Open in Web Editor NEW

278.0 11.0 14.0 83 KB

Recent Advances in Vision and Language Pre-training (VLP)

License: Apache License 2.0

vision-and-language-pre-training vision-and-language pretraining multimodal-deep-learning vlp

awesome-vision-and-language-pre-training's Introduction

Recent Advances in Vision-and-Language Pre-training (VLP)

Maintained by Feilong Chen. Last update on 2023/03/04.

Table of Contents

Survey
Image-based VLP
Video-based VLP
Other Transformer-based multimodal networks
Other Resources

Survey

VLP: A Survey on Vision-Language Pre-training, arXiv 2022

Image-based VLP

Representation Learning

Learning Transferable Visual Models From Natural Language Supervision, CLIP, ICML 2021, [code]
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]
VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08, ACL 2020 [code]
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020
Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)
UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]
Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks, arXiv 2019/12
InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining, arXiv 2020/03
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020, [code]
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020/04
ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH, arXiv 2020/06
DeVLBert: Learning Deconfounded Visio-Linguistic Representations, ACM MM 2020, [code]
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers, EMNLP 2020
SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS, ICLR 2021 submission
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations, arXiv 2020/10
Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, arXiv 2020/11
LAMP: Label Augmented Multimodal Pretraining, arXiv 2020/12
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, AAAI 2021
VinVL: Revisiting Visual Representations in Vision-Language Models, CVPR 2021, [code]
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, ICML 2021, [code]
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation, arXiv 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021, [code]
How Much Can CLIP Benefit Vision-and-Language Tasks?, arXiv 2021, [code]
Unifying Vision-and-Language Tasks via Text Generation, ICML 2021, [code]
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs, ACL 2021, [code]
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, arXiv 2021
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, arXiv 2021, [code]
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, CVPR2021, [code]
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, ICML 2022, [code]
Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022, [code]
Unpaired Vision-Language Pre-training via Cross-Modal CutMix, ICML 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, ICML 22, [code]
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, ICML 22, [code]
GIT: A Generative Image-to-text Transformer for Vision and Language, arXiv 2022, [code]
CoCa: Contrastive Captioners are Image-Text Foundation Models, arXiv 2022, [code]
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, arXiv 2022, [code]
PaLI: A Jointly-Scaled Multilingual Language-Image Model, arXiv 2022
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, arXiv 2023
Language Is Not All You Need: Aligning Perception with Language Models, arXiv 2023, [code]
Unifying Vision-Language Representation Space with Single-tower Transformer, AAAI 2023

Task-specific

Image Caption

Image captioning: XGPT: Cross-modal Generative Pre-Training for Image Captioning, arXiv 2020/03

VQA

VQA: Fusion of Detected Objects in Text for Visual Question Answering, EMNLP 2019, [code], (B2T2)
TextVQA: Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020, [code], (M4C)
Chart VQA: STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.
Visual Question Generation: BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, arXiv 2020/02
TextVQA: TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation, arXiv 2022, [code], (TAG)

Visual Dialog

VisDial: VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)
VisDial: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline, ECCV 2020 [code], (VisDial-BERT)
VisDial: UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog, CVPR 2022

Text-Image Retrieval

Text-image retrieval: ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data, arXiv 2020/01
Text-image retrieval: CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH, ICLR 2021 submission.
Text-image retrieval: Learning Relation Alignment for Calibrated Cross-modal Retrieval, ACL 2021.
Text-image retrieval: Dynamic Contrastive Distillation for Image-Text Retrieval, arXiv 2022/07.
Text-image retrieval: Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval, SIGIR 2022.

Visual Language Navigation

VLN: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020, [code], (PREVALENT)

Visual Machine Reading Comprehension

VisualMRC: VisualMRC: Machine Reading Comprehension on Document Images, AAAI 2021, (LayoutT5, LayoutBART)

Other Tasks

Visual Relationship Detection: Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations, IEEE Access 2021

Other Analysis

Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]
Multi-task Learning, Unifying Vision-and-Language Tasks via Text Generation, arXiv 2021/02
Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]
In-depth Analysis, Are we pretraining it right? Digging deeper into visio-linguistic pretraining,
In-depth Analysis, Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, ECCV 2020 Spotlight
In-depth Analysis, A Closer Look at the Robustness of Vision-and-Language Pre-trained Models, arXiv 2020/12
Adversarial Training, Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 Spotlight
Adaptive Analysis, Adaptive Transformers for Learning Multimodal Representations, ACL SRW 2020
Neural Architecture Search, Deep Multimodal Neural Architecture Search, arXiv 2020/04
Dataset perspective, Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, arXiv 2021/02

Video-based VLP

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019
Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)
M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019/08
BERT for Large-scale Video Segment Classification with Test-time Augmentation, ICCV 2019 YouTube8M workshop, [code]
Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog, AAAI2020 DSTC8 workshop
Learning Spatiotemporal Features via Video and Text Pair Discrimination, arXiv 2020/01, (CPD), [code]
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02
ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, EMNLP 2020
Video-Grounded Dialogues with Pretrained Generation Language Models, ACL 2020
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training, arXiv 2020/07
Multimodal Pretraining for Dense Video Captioning, arXiv 2020/11
PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING, arXiv 2020/12
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021

Other Transformer-based multimodal networks

Multi-Modality Cross Attention Network for Image and Sentence Matching, ICCV 2020
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, ACL 2020
History for Visual Dialog: Do we really need it?, ACL 2020
Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

Other Resources

Two recent surveys on pretrained language models
- Pre-trained Models for Natural Language Processing: A Survey, arXiv 2020/03
- A Survey on Contextual Embeddings, arXiv 2020/03
Other surveys about multimodal research
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, arXiv 2019
- Deep Multimodal Representation Learning: A Survey, arXiv 2019
- Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018
- A Comprehensive Survey of Deep Learning for Image Captioning, ACM Computing Surveys 2018
Other repositories of relevant reading list
Simple Survey on VLP
- VLP Survey on Representation Learning, Feilong Chen, BaiduYun password:bujb
- VLP Survey on Multimodal Retrieval, Duoduo Feng, BaiduYun, password:xobv

awesome-vision-and-language-pre-training's People

Contributors

Stargazers

Watchers

Forkers

cv-ip aw-jiang thfthf henryjunw ttengwang alphadl douglas2code imogenqi smiyawaki0820 tinker713 donkeyshot21 2132660698 stevenchewu seongwoong-sk

awesome-vision-and-language-pre-training's Issues

Add Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

RT

Add Learning Relation Alignment for Calibrated Cross-modal Retrieval (ACL'21)

Please consider adding the following paper, thanks.

Text-Image Retrieval: Learning Relation Alignment for Calibrated Cross-modal Retrieval (ACL 2021)

Add Unifying Vision-Language Representation Space with Single-tower Transformer

Please consider adding the following paper.

Unifying Vision-Language Representation Space with Single-tower Transformer (AAAI`23)

Thanks :)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.