I reorganize the list in chronological order and add some new papers.
The original repo is maintained by WANG Yue ([email protected]). Last update on 2021/06/14.
Update log:*
2021-10-29: add CVPR'21, ICCV'21, ACL'21, ICML'21 papers
Other Transformer-based multimodal networks
[2021.08] SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
SOTA on VQA
[2021.05] SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels
[2021.05] Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
Medical image precessing
[2020.12] LAMP: Label Augmented Multimodal Pretraining
[2020.12] A Closer Look at the Robustness of Vision-and-Language Pre-trained Models,
In-depth Analysis
[2020.11] Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs
[2020.10] CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations,
[2020.07] How Much Can CLIP Benefit Vision-and-Language Tasks? ICLR 2021 submission
[2020.04] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers,
[2020.03] InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining
[ICML] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision [code]
[ICML] Unifying Vision-and-Language Tasks via Text Generation
Multi-task Learning
[ICML] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Dataset perspective
[NIPS] Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [code]
[ICCV] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation [code]
[ICCV] Airbert: In-Domain Pretraining for Vision-and-Language Navigation [code]
[CVPR] Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning
[CVPR] UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training
[CVPR] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Dataset
[CVPR] Multimodal Contrastive Training for Visual Representation Learning
[CVPR] Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain
[CVPR] Causal Attention for Vision-Language Tasks [code]
[CVPR] M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training [code]
[CVPR] VinVL: Revisiting Visual Representations in Vision-Language Models [detection code] [pretraining code]
[ACL] UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning [code]
[ACL] E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
[AAAI] Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
[AAAI] ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
[NIPS] Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Adversarial Training
[ICLR] VL-BERT: Pre-training of Generic Visual-Linguistic Representations [code]
[CVPR] 12-in-1: Multi-Task Vision and Language Representation Learning [code]
Multi-task Learning
[ACL] VisualBERT: A Simple and Performant Baseline for Vision and Language[code]
[AAAI] Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
[AAAI] Unified Vision-Language Pre-Training for Image Captioning and VQA [code] (VLP)
[ECCV] Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
In-depth Analysis
[ECCV] UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]
[ECAI] Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks
[ECCV] Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [code]
[ACMMM] DeVLBert: Learning Deconfounded Visio-Linguistic Representations [code]
[EMNLP] X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers [code]
[NIPS] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [code]
[EMNLP] LXMERT: Learning Cross-Modality Encoder Representations from Transformers [code]
[ICLR submission] Cross-Probe BERT for Efficient and Effective Cross-Modal Search, ICLR 2021 submission.
[2020.02] What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Visual Question Generation
[2020.01] ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data
Text-image retrieval
[CVPR] A Recurrent Vision-and-Language BERT for Navigation
Vision-language navigation
[AAAI] VisualMRC: Machine Reading Comprehension on Document Images, (LayoutT5, LayoutBART)
Machine Reading Comprehension
[IEEE Access] Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations
Visual Relationship Detection
[NLPCC] XGPT: Cross-modal Generative Pre-Training for Image Captioning
Image captioning
[CVPR] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA [code], (M4C)
TextVQA
[CVPR] Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training [code], (PREVALENT)
Vision-Language Navigation
[ECCV] Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline [code], (VisDial-BERT)
Visual Dialog
[EMNLP] VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)
Visual Dialog
[EMNLP] STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.
Chart VQA
[EMNLP] Fusion of Detected Objects in Text for Visual Question Answering [code], (B2T2)
Visual Question Answering
[ICCVW 2021] Are we pretraining it right? Digging deeper into visio-linguistic pretraining,
In-depth Analysis
[ACL SRW 2020] Adaptive Transformers for Learning Multimodal Representations
Adaptive Analysis
[arXiv 2020.04] Deep Multimodal Neural Architecture Search
Neural Architecture Search
[arXiv 2020.02] Measuring Social Biases in Grounded Vision and Language Embeddings [code]
Social Bias in VL Embedding
[2020.07] Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
[2020.02] UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [code]
[2020.01] Learning Spatiotemporal Features via Video and Text Pair Discrimination [code] (CPD)
[2019.08] M-BERT: Injecting Multimodal Information in the BERT Structure
[2019.06] Learning Video Representations Using Contrastive Bidirectional Transformers [ICLR 2020 submission] (CBT)
[ACL Findings] VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
[CVPR] Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
[ICLR] Parameter Efficient Multimodal Transformers for Video Representation Learning
[CVPR] ActBERT: Learning Global-Local Video-Text Representations
[EMNLP] HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
[ACL] Video-Grounded Dialogues with Pretrained Generation Language Models
[AACL] Multimodal Pretraining for Dense Video Captioning
[AAAIW] Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog
[ICCV] VideoBERT: A Joint Model for Video and Language Representation Learning
[ICCVW] BERT for Large-scale Video Segment Classification with Test-time Augmentation [code]
[arXiv 2019.11] Effectiveness of self-supervised pre-training for speech recognition
[arXiv 2019.10] SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering
[arXiv 2019.10] Vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
[arXiv 2019.06] Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
[arXiv 2019.09] Understanding Semantics from Speech Through Pre-training
Two recent surveys on pretrained language models
- Pre-trained Models for Natural Language Processing: A Survey, arXiv 2020/03
- A Survey on Contextual Embeddings, arXiv 2020/03
Other surveys about multimodal research
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, JAIR 2021
- Deep Multimodal Representation Learning: A Survey, arXiv 2019
- Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018
- A Comprehensive Survey of Deep Learning for Image Captioning, ACM Computing Surveys 2018
Other repositories of relevant reading list