Code Monkey home page Code Monkey logo

audio_classification's Introduction

RGB representation of sound for environmental sound classification:

Dataset description:

ESC-50 clip preview

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

Animals Natural soundscapes & water sounds Human, non-speech sounds Interior/domestic sounds Exterior/urban noises
Dog Rain Crying baby Door knock Helicopter
Rooster Sea waves Sneezing Mouse click Chainsaw
Pig Crackling fire Clapping Keyboard typing Siren
Cow Crickets Breathing Door, wood creaks Car horn
Frog Chirping birds Coughing Can opening Engine
Cat Water drops Footsteps Washing machine Train
Hen Wind Laughing Vacuum cleaner Church bells
Insects (flying) Pouring water Brushing teeth Clock alarm Airplane
Sheep Toilet flush Snoring Clock tick Fireworks
Crow Thunderstorm Drinking, sipping Glass breaking Hand saw

A detailed description of the dataset is available at ESC-50 dataset where it can be downloaded (~600 MB)

K. J. Piczak. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 2015.

[DOI: http://dx.doi.org/10.1145/2733373.2806390]

Converting sound to RGB picture

Audio have many different ways to be represented, going from raw time series to time-frequency decompositions. The choice of the representation is crucial for the performance of your system. Initially, the ESC task was many carried out based on handcrafted features or Mel-frequency cepstral coefficients (MFCCs). More recently, time frequency representations have been particularly useful for environmental sound classification. Indeed, they better captures the non-stationarity and dynamic nature of the sound. Additionally, they are adapted to the use of convolutional neural networks (CNNs) that recently revolutionized the computer vision field. Among time-frequency decompositions, Spectrograms have been proved to be a useful representation for ESC task, and more generally for audio processing. They consist in 2D images representing sequences of Short Time Fourier Transform (STFT) with time and frequency as axes, and brightness representing the strength of a frequency component at each time frame. In such they appear a natural domain to apply CNNS architectures for images directly to sound.

Another less commonly used representation is the Scalogram, 2 dimensional output of the wavelet transform with scale and time as axes. As wavelet can filter signal in a multiscale way, scalogram may outperform spectrogram for the ESC task.

In this project, sound is represented as a RGB colour image, red being spectrogram, green being scalogram and blue MFCC.

Sound representation

Classifying environmental sound using transfer learning

Environmental sound data generally suffers from limited labeled datasets, and ESC-50 is not an exception. This makes it difficult to train deep models from scratch. On the contrary, In computer vision, the availability of large datasets such as ImageNet enabled to train very deep models for the task of image classification. Since 2012, state of the art computer vision networks have been trained on ImageNet, a dataset containing approximatively 1.2 million images separated into 1000 categories. Best performing architectures relied on convolutional neural networks and some of the pre-trained weights are available in the Keras library.

Among them, we considered VGG19, InceptionResnetv2 and Xception.

VGG19 is composed of convolutional layers of 3x3 filters and is characterized by its simplicity. InceptionResnetv2 combined the inception module acting as a multi-level feature extractor with the residual connections to accelerate the training. Finally, Xception is an extension of Inception architecture with the use of depthwise separable convolutions. The pre-trained weights of these networks can be used as features extractors on our RGB representation of sound in a transfer learning approach.

Transfer learning

audio_classification's People

Contributors

vbelz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.