Code Monkey home page Code Monkey logo

language-grounding's Introduction

Awesome Language Grounding

A curated list of resources for language grounding research.

Contributing

Please feel free to email Borui Wang ([email protected]).

Table of Contents

Surveys

Experience Grounds Language [Paper]

Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian (EMNLP 2020)

This survey posits that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.

Courses

Carnegie Mellon University 10-808: Language Grounding to Vision and Control (Fall 2017) [Course Link]

Instructor: Prof. Katerina Fragkiadaki

This is a seminar course that visits recent progress on the problem of language acquisition through pairing of multiple modalities (vision, haptics, audio etc), as well as active interaction with the world. The central questions/topics covered are: How can language help accelerate learning of an autonomous agent? How humans acquire language and why? Inductive biases for strong generalization. Architectures for agent capable of compositional grounding of language. State representation of video visual scenes and imaginations from story reading. Language for high level planning and control. Neural-symbolic architectures for hierarchical symbolic grounding.

University of Texas at Austin CS 395T: Grounded Natural Language Processing (Spring 2021) [Course Link]

Instructor: Prof. Raymond J. Mooney

This course is a graduate research seminar in grounded natural language processing (GNLP), a subarea of AI that studies the connection between natural language and perception and action in the world. It makes connections between natural language processing (NLP) and computer vision, robotics, and computer graphics. Almost all work in the area uses machine learning to learn the connection between language and perception and/or action from some form of multi-modal training data.

Papers

Language Grounding to Vision

Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search [Paper]

Jamie Ryan Kiros, William Chan, Geoffrey E. Hinton (ACL 2018)

This paper introduces Picturebook, a large-scale lookup operation to ground language via ‘snapshots’ of our physical world accessed through image search.

Learning Visually Grounded Sentence Representations [Paper]

Douwe Kiela, Alexis Conneau, Allan Jabri, Maximilian Nickel (NAACL 2018)

This paper investigates grounded sentence representations, where they train a sentence encoder to predict the image features of a given caption. They examine the quality of the learned representations on a variety of standard sentence representation quality benchmarks, showing improved performance for grounded models over non-grounded ones.

Grounding language acquisition by training semantic parsers using captioned videos [Paper]

Candace Ross, Andrei Barbu, Yevgeni Berzak, Battushig Myanganbayar, Boris Katz (EMNLP 2018)

This paper develops a semantic parser that is trained in a grounded setting using pairs of videos captioned with sentences. This setting is both data-efficient, requiring little annotation, and similar to the experience of children where they observe their environment and listen to speakers. The semantic parser recovers the meaning of English sentences despite not having access to any annotated sentences. It does so despite the ambiguity inherent in vision where a sentence may refer to any combination of objects, object properties, relations or actions taken by any agent in a video. For this task, the authors collected a new dataset for grounded language acquisition. Learning a grounded semantic parser — turning sentences into logical forms using captioned videos — can significantly expand the range of data that parsers can be trained on, lower the effort of training a semantic parser, and ultimately lead to a better understanding of child language acquisition.

Visually Grounded Neural Syntax Acquisition [Paper]

Haoyue Shi, Jiayuan Mao, Kevin Gimpel, Karen Livescu (ACL 2019)

This paper presents the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images.

Incorporating Visual Semantics into Sentence Representations within a Grounded Space [Paper]

Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, Patrick Gallinari (EMNLP 2019)

This paper proposes to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. They further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. They show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering [Paper]

Drew A. Hudson, Christopher D. Manning (CVPR 2019)

This paper introduces GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. They have developed a strong and robust question engine that leverages scene graph structures to create 22M diverse reasoning questions, all come with functional programs that represent their semantics. They use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases.

VideoBERT: A Joint Model for Video and Language Representation Learning [Paper]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid

This paper proposes a joint visual-linguistic model to learn high-level features without any explicit supervision. They build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. They use VideoBERT in numerous tasks, including action classification and video captioning. They show that it can be applied directly to open-vocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, They outperform the state-of-the-art on video captioning, and quantitative results verify that the model learns high-level semantic features.

Language Grounding to Robotics

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks [Paper] [Project Page]

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox (CVPR 2020)

This paper presents ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives.

Datasets

  • GQA from Stanford University [Dataset]
  • Visual Commensen Reasoning (VCR) from University of Washington and AI2 [Dataset]
  • MS COCO [Dataset]
  • Visual Genome from Stanford [Dataset]
  • ALFRED from University of Washington [Dataset]

language-grounding's People

Contributors

wbr0605 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.