Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

I reorganize the list in chronological order and add some new papers.

The original repo is maintained by WANG Yue ([email protected]). Last update on 2021/06/14.

Update log:*

2021-10-29: add CVPR'21, ICCV'21, ACL'21, ICML'21 papers

Image-based VL-PTMs

Representation Learning
Task-specific
Other Analysis

Video-based VL-PTMs

Speech-based VL-PTMs

Other Transformer-based multimodal networks

Other Resources

Image-based VL-PTMs

Representation Learning

Task-specific

Arxiv

[ICLR submission] Cross-Probe BERT for Efficient and Effective Cross-Modal Search, ICLR 2021 submission.

[2020.02] What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Visual Question Generation

[2020.01] ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data
Text-image retrieval

2021

[CVPR] A Recurrent Vision-and-Language BERT for Navigation
Vision-language navigation

[AAAI] VisualMRC: Machine Reading Comprehension on Document Images, (LayoutT5, LayoutBART) Machine Reading Comprehension

[IEEE Access] Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations
Visual Relationship Detection

[NLPCC] XGPT: Cross-modal Generative Pre-Training for Image Captioning
Image captioning

2020

[CVPR] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA [code], (M4C)
TextVQA

[CVPR] Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training [code], (PREVALENT)
Vision-Language Navigation

[ECCV] Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline [code], (VisDial-BERT)
Visual Dialog

[EMNLP] VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)
Visual Dialog

[EMNLP] STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.
Chart VQA

2019

[EMNLP] Fusion of Detected Objects in Text for Visual Question Answering [code], (B2T2)
Visual Question Answering

Other Analysis

[ICCVW 2021] Are we pretraining it right? Digging deeper into visio-linguistic pretraining,
In-depth Analysis

[ACL SRW 2020] Adaptive Transformers for Learning Multimodal Representations
Adaptive Analysis

[arXiv 2020.04] Deep Multimodal Neural Architecture Search
Neural Architecture Search

[arXiv 2020.02] Measuring Social Biases in Grounded Vision and Language Embeddings [code]
Social Bias in VL Embedding

Video-based VL-PTMs

Arxiv

[2020.07] Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

[2020.02] UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [code]

[2020.01] Learning Spatiotemporal Features via Video and Text Pair Discrimination [code] (CPD)

[2019.08] M-BERT: Injecting Multimodal Information in the BERT Structure

[2019.06] Learning Video Representations Using Contrastive Bidirectional Transformers [ICLR 2020 submission] (CBT)

2021

[ACL Findings] VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

[CVPR] Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

[ICLR] Parameter Efficient Multimodal Transformers for Video Representation Learning

2020

[CVPR] ActBERT: Learning Global-Local Video-Text Representations

[EMNLP] HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

[ACL] Video-Grounded Dialogues with Pretrained Generation Language Models

[AACL] Multimodal Pretraining for Dense Video Captioning

[AAAIW] Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

2019

[ICCV] VideoBERT: A Joint Model for Video and Language Representation Learning

[ICCVW] BERT for Large-scale Video Segment Classification with Test-time Augmentation [code]

Speech-based VL-PTMs

[arXiv 2019.11] Effectiveness of self-supervised pre-training for speech recognition

[arXiv 2019.10] SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering

[arXiv 2019.10] Vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

[arXiv 2019.06] Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models

[arXiv 2019.09] Understanding Semantics from Speech Through Pre-training

Other Resources

Two recent surveys on pretrained language models

Pre-trained Models for Natural Language Processing: A Survey, arXiv 2020/03
A Survey on Contextual Embeddings, arXiv 2020/03

Other surveys about multimodal research

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, JAIR 2021
Deep Multimodal Representation Learning: A Survey, arXiv 2019
Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018
A Comprehensive Survey of Deep Learning for Image Captioning, ACM Computing Surveys 2018

Other repositories of relevant reading list

ttengwang / awesome-vision-language-pretraining-papers Goto Github PK

awesome-vision-language-pretraining-papers's Introduction

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

Table of Contents

Image-based VL-PTMs

Representation Learning

Arxiv

2021

2020

2019

Task-specific

Arxiv

2021

2020

2019

Other Analysis

Video-based VL-PTMs

Arxiv

2021

2020

2019

Speech-based VL-PTMs

Other Resources

awesome-vision-language-pretraining-papers's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org