Code Monkey home page Code Monkey logo

awesome-vision-language-pretraining-papers's Introduction

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

I reorganize the list in chronological order and add some new papers.

The original repo is maintained by WANG Yue ([email protected]). Last update on 2021/06/14.

Update log:*

2021-10-29: add CVPR'21, ICCV'21, ACL'21, ICML'21 papers

Table of Contents

Image-based VL-PTMs

Video-based VL-PTMs

Speech-based VL-PTMs

Other Transformer-based multimodal networks

Other Resources

Image-based VL-PTMs

Representation Learning

Arxiv

[2021.08] SimVLM: Simple Visual Language Model Pretraining with Weak Supervision SOTA on VQA

[2021.05] SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

[2021.05] Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
Medical image precessing

[2020.12] LAMP: Label Augmented Multimodal Pretraining

[2020.12] A Closer Look at the Robustness of Vision-and-Language Pre-trained Models, In-depth Analysis

[2020.11] Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs

[2020.10] CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations,

[2020.07] How Much Can CLIP Benefit Vision-and-Language Tasks? ICLR 2021 submission

[2020.04] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers,

[2020.03] InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

2021

[ICML] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision [code]

[ICML] Unifying Vision-and-Language Tasks via Text Generation
Multi-task Learning

[ICML] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Dataset perspective

[NIPS] Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [code]

[ICCV] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation [code]

[ICCV] Airbert: In-Domain Pretraining for Vision-and-Language Navigation [code]

[CVPR] Seeing Out of the Box: End-to-End Pre-Training for Vision-Language Representation Learning

[CVPR] UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training

[CVPR] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Dataset

[CVPR] Multimodal Contrastive Training for Visual Representation Learning

[CVPR] Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain

[CVPR] Causal Attention for Vision-Language Tasks [code]

[CVPR] M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training [code]

[CVPR] VinVL: Revisiting Visual Representations in Vision-Language Models [detection code] [pretraining code]

[ACL] UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning [code]

[ACL] E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

[AAAI] Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

[AAAI] ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

2020

[NIPS] Large-Scale Adversarial Training for Vision-and-Language Representation Learning Adversarial Training

[ICLR] VL-BERT: Pre-training of Generic Visual-Linguistic Representations [code]

[CVPR] 12-in-1: Multi-Task Vision and Language Representation Learning [code]
Multi-task Learning

[ACL] VisualBERT: A Simple and Performant Baseline for Vision and Language[code]

[AAAI] Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

[AAAI] Unified Vision-Language Pre-Training for Image Captioning and VQA [code] (VLP)

[ECCV] Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models In-depth Analysis

[ECCV] UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]

[ECAI] Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

[ECCV] Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [code]

[ACMMM] DeVLBert: Learning Deconfounded Visio-Linguistic Representations [code]

[EMNLP] X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers [code]

2019

[NIPS] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [code]

[EMNLP] LXMERT: Learning Cross-Modality Encoder Representations from Transformers [code]

Task-specific

Arxiv

[ICLR submission] Cross-Probe BERT for Efficient and Effective Cross-Modal Search, ICLR 2021 submission.

[2020.02] What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Visual Question Generation

[2020.01] ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data
Text-image retrieval

2021

[CVPR] A Recurrent Vision-and-Language BERT for Navigation
Vision-language navigation

[AAAI] VisualMRC: Machine Reading Comprehension on Document Images, (LayoutT5, LayoutBART) Machine Reading Comprehension

[IEEE Access] Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations
Visual Relationship Detection

[NLPCC] XGPT: Cross-modal Generative Pre-Training for Image Captioning
Image captioning

2020

[CVPR] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA [code], (M4C)
TextVQA

[CVPR] Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training [code], (PREVALENT)
Vision-Language Navigation

[ECCV] Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline [code], (VisDial-BERT)
Visual Dialog

[EMNLP] VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)
Visual Dialog

[EMNLP] STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.
Chart VQA

2019

[EMNLP] Fusion of Detected Objects in Text for Visual Question Answering [code], (B2T2)
Visual Question Answering

Other Analysis

[ICCVW 2021] Are we pretraining it right? Digging deeper into visio-linguistic pretraining,
In-depth Analysis

[ACL SRW 2020] Adaptive Transformers for Learning Multimodal Representations
Adaptive Analysis

[arXiv 2020.04] Deep Multimodal Neural Architecture Search
Neural Architecture Search

[arXiv 2020.02] Measuring Social Biases in Grounded Vision and Language Embeddings [code]
Social Bias in VL Embedding

Video-based VL-PTMs

Arxiv

[2020.07] Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

[2020.02] UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [code]

[2020.01] Learning Spatiotemporal Features via Video and Text Pair Discrimination [code] (CPD)

[2019.08] M-BERT: Injecting Multimodal Information in the BERT Structure

[2019.06] Learning Video Representations Using Contrastive Bidirectional Transformers [ICLR 2020 submission] (CBT)

2021

[ACL Findings] VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

[CVPR] Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

[ICLR] Parameter Efficient Multimodal Transformers for Video Representation Learning

2020

[CVPR] ActBERT: Learning Global-Local Video-Text Representations

[EMNLP] HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

[ACL] Video-Grounded Dialogues with Pretrained Generation Language Models

[AACL] Multimodal Pretraining for Dense Video Captioning

[AAAIW] Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

2019

[ICCV] VideoBERT: A Joint Model for Video and Language Representation Learning

[ICCVW] BERT for Large-scale Video Segment Classification with Test-time Augmentation [code]

Speech-based VL-PTMs

[arXiv 2019.11] Effectiveness of self-supervised pre-training for speech recognition

[arXiv 2019.10] SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering

[arXiv 2019.10] Vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

[arXiv 2019.06] Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models

[arXiv 2019.09] Understanding Semantics from Speech Through Pre-training

Other Resources

Two recent surveys on pretrained language models

Other surveys about multimodal research

Other repositories of relevant reading list

awesome-vision-language-pretraining-papers's People

Contributors

hackmd-deploy avatar kmario23 avatar lllllli avatar rtanaka-lab avatar yuewang-cuhk avatar zh-plus avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.