Code Monkey home page Code Monkey logo

awesome-vision-and-language-pre-training's Introduction

Recent Advances in Vision-and-Language Pre-training (VLP)

Maintained by Feilong Chen. Last update on 2023/03/04.

Table of Contents

Survey

  1. VLP: A Survey on Vision-Language Pre-training, arXiv 2022

Image-based VLP

Representation Learning

  1. Learning Transferable Visual Models From Natural Language Supervision, CLIP, ICML 2021, [code]

  2. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

  3. LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

  4. VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]

  5. VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08, ACL 2020 [code]

  6. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020

  7. Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)

  8. UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]

  9. Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks, arXiv 2019/12

  10. InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining, arXiv 2020/03

  11. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020, [code]

  12. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020/04

  13. ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH, arXiv 2020/06

  14. DeVLBert: Learning Deconfounded Visio-Linguistic Representations, ACM MM 2020, [code]

  15. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers, EMNLP 2020

  16. SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS, ICLR 2021 submission

  17. CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations, arXiv 2020/10

  18. Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, arXiv 2020/11

  19. LAMP: Label Augmented Multimodal Pretraining, arXiv 2020/12

  20. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, AAAI 2021

  21. VinVL: Revisiting Visual Representations in Vision-Language Models, CVPR 2021, [code]

  22. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, ICML 2021, [code]

  23. OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation, arXiv 2021

  24. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021, [code]

  25. How Much Can CLIP Benefit Vision-and-Language Tasks?, arXiv 2021, [code]

  26. Unifying Vision-and-Language Tasks via Text Generation, ICML 2021, [code]

  27. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs, ACL 2021, [code]

  28. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, arXiv 2021

  29. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, arXiv 2021, [code]

  30. Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, CVPR2021, [code]

  31. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, ICML 2022, [code]

  32. Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022, [code]

  33. Unpaired Vision-Language Pre-training via Cross-Modal CutMix, ICML 2022

  34. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, ICML 22, [code]

  35. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, ICML 22, [code]

  36. GIT: A Generative Image-to-text Transformer for Vision and Language, arXiv 2022, [code]

  37. CoCa: Contrastive Captioners are Image-Text Foundation Models, arXiv 2022, [code]

  38. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, arXiv 2022, [code]

  39. PaLI: A Jointly-Scaled Multilingual Language-Image Model, arXiv 2022

  40. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, arXiv 2023

  41. Language Is Not All You Need: Aligning Perception with Language Models, arXiv 2023, [code]

  42. Unifying Vision-Language Representation Space with Single-tower Transformer, AAAI 2023

Task-specific

Image Caption

  1. Image captioning: XGPT: Cross-modal Generative Pre-Training for Image Captioning, arXiv 2020/03

VQA

  1. VQA: Fusion of Detected Objects in Text for Visual Question Answering, EMNLP 2019, [code], (B2T2)

  2. TextVQA: Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020, [code], (M4C)

  3. Chart VQA: STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.

  4. Visual Question Generation: BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, arXiv 2020/02

  5. TextVQA: TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation, arXiv 2022, [code], (TAG)

Visual Dialog

  1. VisDial: VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)

  2. VisDial: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline, ECCV 2020 [code], (VisDial-BERT)

  3. VisDial: UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog, CVPR 2022

Text-Image Retrieval

  1. Text-image retrieval: ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data, arXiv 2020/01

  2. Text-image retrieval: CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH, ICLR 2021 submission.

  3. Text-image retrieval: Learning Relation Alignment for Calibrated Cross-modal Retrieval, ACL 2021.

  4. Text-image retrieval: Dynamic Contrastive Distillation for Image-Text Retrieval, arXiv 2022/07.

  5. Text-image retrieval: Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval, SIGIR 2022.

Visual Language Navigation

  1. VLN: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020, [code], (PREVALENT)

Visual Machine Reading Comprehension

  1. VisualMRC: VisualMRC: Machine Reading Comprehension on Document Images, AAAI 2021, (LayoutT5, LayoutBART)

Other Tasks

  1. Visual Relationship Detection: Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations, IEEE Access 2021

Other Analysis

  1. Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]

  2. Multi-task Learning, Unifying Vision-and-Language Tasks via Text Generation, arXiv 2021/02

  3. Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]

  4. In-depth Analysis, Are we pretraining it right? Digging deeper into visio-linguistic pretraining,

  5. In-depth Analysis, Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, ECCV 2020 Spotlight

  6. In-depth Analysis, A Closer Look at the Robustness of Vision-and-Language Pre-trained Models, arXiv 2020/12

  7. Adversarial Training, Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 Spotlight

  8. Adaptive Analysis, Adaptive Transformers for Learning Multimodal Representations, ACL SRW 2020

  9. Neural Architecture Search, Deep Multimodal Neural Architecture Search, arXiv 2020/04

  10. Dataset perspective, Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, arXiv 2021/02

Video-based VLP

  1. VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

  2. Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)

  3. M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019/08

  4. BERT for Large-scale Video Segment Classification with Test-time Augmentation, ICCV 2019 YouTube8M workshop, [code]

  5. Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog, AAAI2020 DSTC8 workshop

  6. Learning Spatiotemporal Features via Video and Text Pair Discrimination, arXiv 2020/01, (CPD), [code]

  7. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02

  8. ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020

  9. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, EMNLP 2020

  10. Video-Grounded Dialogues with Pretrained Generation Language Models, ACL 2020

  11. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training, arXiv 2020/07

  12. Multimodal Pretraining for Dense Video Captioning, arXiv 2020/11

  13. PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING, arXiv 2020/12

  14. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021

Other Transformer-based multimodal networks

  1. Multi-Modality Cross Attention Network for Image and Sentence Matching, ICCV 2020

  2. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, ACL 2020

  3. History for Visual Dialog: Do we really need it?, ACL 2020

  4. Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

Other Resources

awesome-vision-and-language-pre-training's People

Contributors

alphadl avatar henryjunw avatar phellonchen avatar ttengwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-vision-and-language-pre-training's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.