Code Monkey home page Code Monkey logo

awesome-audio-visual's Introduction

Awesome Audio-Visual: Awesome

A curated list of papers and datsets for various audio-visual tasks, inspired by awesome-computer-vision.

Contents

Audio-Visual Localization

Audio-Visual Separation

Audio-Visual Representation/Classification/Retrieval

Audio-Visual Action Recognition

Audio-Visual Spatial/Depth

Audio-Visual Highlight Detection

Audio-Visual Deepfake

Audio-Visual Navigation/RL

Audio-Visual Faces/Speech

Audio-Visual Learning of Scene Acoustics

Audio-Visual Question Answering

Cross-modal Generation (Audio-Video / Video-Audio)

Audio-Visual Stylization/Generation

Multi-modal Architectures

Uncategorized Papers

Datasets

General Audio-Visual Tasks

  • AudioSet - Audio-Visual Classification
  • MUSIC - Audio-Visual Source Separation
  • AudioSetZSL - Audio-Visual Zero-shot Learning
  • Visually Engaged and Grounded AudioSet (VEGAS) - Sound generation from video
  • SoundNet-Flickr - Image-Audio pair for cross-modal learning
  • Audio-Visual Event (AVE) - Audio-Visual Event Localization
  • AudioSet Single Source - Subset of AudioSet videos containing only a single souding object
  • Kinetics-Sounds - Subset of Kinetics dataset
  • EPIC-Kitchens - Egocentric Audio-Visual Action Recogniton
  • Audio-Visually Indicated Actions Dataset - Multimodal dataset (RGB, acoustic data as raw audio) acquired using the acoustic-optical camera
  • IMSDb dataset - Movie scripts downloaded from The Internet Script Movie Database
  • YOUTUBE-ASMR-300K dataset - ASMR videos collected from YouTube that contains stereo audio
  • FAIR-Play - 1,871 video clips and their corresponding binaural audio clips recorded in a music room
  • VGG-Sound - audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
  • XD-Violence - weakly annotated dataset for audio-visual violence detection
  • AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) - Geotagged aerial images and sounds, classified into 13 scene classes
  • auDIoviSual Crowd cOunting dataset (DISCO) - 1,935 Images and audios from various typical scenes, a total of 170, 270 instances annotated with the head locations.
  • MUSIC-Synthetic dataset- Category-balanced multi-source videos by artificially synthesizing solo videos from the MUSIC dataset, to facilitate the learning and evaluation of multiple-soundings-sources localization in the cocktail-party scenario.
  • ACAV100M - 140 million full-length videos (total duration 1,030 years) and produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence.
  • AIST++ - A large-scale 3D human dance motion dataset, which contains a wide variety of 3D motion paired with music It is built upon the AIST Dance Database, which is an uncalibrated multi-view collection of dance videos.
  • VideoCC - A dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset.
  • ssw60 - A dataset for research on adiovisual fine-grained categorization. The dataset covers 60 species of birds that all occur in a specific geographic location: Sapsucker Woods, Ithaca, NY. It is comprised of images from existing datasets, and brand new, expert curated audio and video data.
  • PACS - A dataset designed to help create and evaluate a new generation of AI algorithms able to reason about physical commonsense using both audio and visual modalities.
  • AVSBench - A dataset for audio-visual pixel-wise segmentation task.
  • UnAV-100 - The dataset consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes.

Face-Voice Dataset

Licenses

License

CC0

To the extent possible under law, Kranti Kumar Parida has waived all copyright and related or neighboring rights to this work.

Contributing

Please feel free to send me pull requests or email ([email protected]) to add links, correct wrong ones or if you find any broken links.

awesome-audio-visual's People

Contributors

dtaoo avatar himangim avatar krantiparida avatar sagnikmjr avatar xjchengit avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.