Code Monkey home page Code Monkey logo

polyphonic-sound-event-detection-with-weak-labeling-dataset-given-'s Introduction

cmu-thesis

This repository contains the code for three experiments in my PhD thesis, Polyphonic Sound Event Detection with Weak Labeling:

  • Sound event detection with presence/absence labeling on the DCASE 2017 challenge (Chapter 3.2)
  • Sound event detection with presence/absence labeling on Google Audio Set (Chapter 3.3)
  • Sound event detection with sequential labeling on a subset of Google Audio Set (Chapter 4)

Prerequisites

Hardware:

  • A GPU
  • Large storage (1 TB recommended)

Software:

  • Python 2.7
  • PyTorch (I used version 0.4.0a0+d3b6c5e)
  • numpy, scipy, joblib

Quick Start

# Clone the repository
git clone https://github.com/MaigoAkisame/cmu-thesis.git

# Download the data: may take up to 1 day!
cd cmu-thesis/data
./download.sh

# Train a model for the DCASE experiment using default settings
cd ../code/dcase
python train.py            # Needs to run on a GPU

# Evaluate the model at Checkpoint 25
python eval.py --ckpt=25   # Needs to run on a GPU for the first time

# Download and evaluate the TALNet model for the Audio Set experiment
cd ../audioset
./eval-TALNet.sh           # Needs to run on a GPU for the first time

Organization of the Repository

code

The code directory contains three sub-directories: dcase, audioset, and sequential. These contain the code for the three experiments. In each subdirectory:

  • Net.py defines the network architecture (you don't need to execute this script directly);
  • train.py trains the network;
  • eval.py evaluates the network's performance.

The train.py and eval.py script can take many command line arguments, which specify the architecture of the network and the hyperparameters used during training. If you encounter "out of memory" errors, a good idea is to reduce the batch size.

Some scripts that may be of special interest:

  • code/*/util_in.py: Implements data balancing so that each minibatch contains roughly equal numbers of recordings of each event type;
  • code/sequential/ctc.py: My implementation of connectionist temporal classification (CTC);
  • code/sequential/ctl.py: My implementation of connectionist temporal localization (CTL).

data

The script data/download.sh will download and extract the following three archives in the data directory:

These archives contain Matlab data files (with the .mat extension) that store the filterbank features and ground truth labels. They can be loaded with the scipy.io.loadmat function in Python. Each Matlab file contains three matrices:

  • feat: Filterbank features, a float32 array of shape (n, 400, 64) (n recordings, 400 frames, 64 frequency bins);
  • labels:
    • Presence/absence labeling, a boolean array of shape (n, m) (n recordings, m event types), or
    • Strong labelng, a boolean array of shape (n, 100, m) (n recordings, 100 frames, m event types);
  • hashes: A character array of size (n, 11), containing the YouTube hash IDs of the recordings.

Training recordings are organized by class (so data balancing can be done easily), and each Matlab file contains up to 101 recordings. Validation and test/evaluation recordings are stored in Matlab files that contain up to 500 recordings each.

Because the data is so huge, I do not provide the code for downloading the raw data, extracting features, and organizing the features and labels into Matlab data files. The whole process took me more than a month and endless babysitting!

workspace

The training logs, trained models, predictions on the test/evaluation recordings, and evaluation results will be generated in this directory. The sub-directory names will reflect the network architecture and hyperparameters for training.

The script code/audioset/eval-TALNet.py will download the TALNet model and store it at workspace/audioset/TALNet/model/TALNet.pt. At the time of my graduation (October 2018), this is the best model that can both classify and localize sound events on Google Audio Set.

Citing

If you use this code in your research, please cite my PhD thesis:

  • Yun Wang, "Polyphonic sound event detection with weak labeling", PhD thesis, Carnegie Mellon University, Oct. 2018.

and/or the following publications:

  • Yun Wang, Juncheng Li and Florian Metze, "A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling," arXiv e-prints, Oct. 2018. [Online]. Available: http://arxiv.org/abs/1810.09050.
  • Yun Wang and Florian Metze, "Connectionist temporal localization for sound event detection with sequential labeling," arXiv e-prints, Oct. 2018. [Online]. Available: http://arxiv.org/abs/1810.09052.

polyphonic-sound-event-detection-with-weak-labeling-dataset-given-'s People

Contributors

maigoakisame avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.