Code Monkey home page Code Monkey logo

drstef / deep-learning-and-digital-signal-processing-for-environmental-sound-classification Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 8.7 MB

Automatic environmental sound classification (ESC) based on ESC-50 dataset (and ESC-10 subset)

License: Other

Jupyter Notebook 100.00%
convolutional-neural-networks digital-signal-processing keras-tensorflow librosa mel-spectrograms python wavelet-transform audio audio-processing cwt deep-learning complex-wavelets

deep-learning-and-digital-signal-processing-for-environmental-sound-classification's Introduction

Deep Learning and Digital Signal Processing for Environmental Sound Classification


Introduction


Automatic environmental sound classification (ESC) based on ESC-50 dataset (and ESC-10 subset) built by Karol Piczak and described in the following article:

"Karol J. Piczak. 2015. "ESC: Dataset for Environmental Sound Classification." In Proceedings of the 23rd ACM international conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1015โ€“1018. https://doi.org/10.1145/2733373.2806390".

ESC-50 dataset is available from Dr. Piczak's Github: https://github.com/karoldvl/ESC-50/ The following recent article is a descriptive survey for Environmental sound classification (ESC) detailing datasets, preprocessing techniques, features and classifiers. And their accuracy.

Anam Bansal, Naresh Kumar Garg, "Environmental Sound Classification: A descriptive review of the literature, Intelligent Systems with Applications, Volume 16, 2022, 200115, ISSN 2667-3053, https://doi.org/10.1016/j.iswa.2022.200115.

Dr. Piczak maintains a Table with best results in his Github, with authors, publication, method used. We reproduce the top of the Table here, for supervised classification.

Title Notes Accuracy Paper Code
BEATs: Audio Pre-Training with Acoustic Tokenizers Transformer model pretrained with acoustic tokenizers 98.10% chen2022 ๐Ÿ“œ
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules 97.00% chen2022 ๐Ÿ“œ
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 96.70% elizalde2022 ๐Ÿ“œ
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet 95.70% gong2021 ๐Ÿ“œ
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer A Transformer model pretrained w/ visual image supervision 95.70% zhao2022 ๐Ÿ“œ

We develop our own pre-processing techniques for achieving best accuracy results based on Dr. Piczak Table and Bansal et al.
At that point, and before we start working on more advanced techniques:

  • we work with the ESC-10 sub-dataset.
  • we test mel-spectrograms and wavelet transforms.

We will train a Convolution Neural Network with grayscale spectrograms and scalograms. We target an accuracy >>90 %.
When tests with the most effective CNN algorithm implementation are completed, we will run predictions with various audio clips downloaded from Youtube. And eventually update CNN hyperparameters.

ESC-10 Type of sounds/noises


The ESC-10 dataset contains 5 seconds long 400 Ogg Vorbis audio clips: sampling frequency: 44.1 kHz, 32- bits float, and 10 classes.
40 audio clips per class.
The 10 Sound/Noise classes are:

  • Class = 01-Dogbark, Label = 0
  • Class = 02-Rain, Label = 1
  • Class = 03-Seawaves, Label = 2
  • Class = 04-Babycry, Label = 3
  • Class = 05-Clocktick, Label = 4
  • Class = 06-Personsneeze, Label = 5
  • Class = 07-Helicopter, Label = 6
  • Class = 08-Chainsaw, Label = 7
  • Class = 09-Rooster, Label = 8
  • Class = 10-Firecrackling, Label = 9

Quick analysis of the type of sound/noise:

  • dogbarking, babycry, person sneeze, rooster, involve non-linear vibration and resonance of vocal (or nasal) tract and cords, a bit like speech, and is considered non-stationnary.
  • Rain, sea waves are somewhat stationary, rain sounds a bit like white noise. Pseudo-stationnary because in various audio clips other noises are involved at times.
  • Helicopter, chainsaw: pseudo-stationary. If the engine r.p.m does not change in a timeframe, the process is stationary. With harmonics linked to the engine rpm, number of cylinders, and the number of rotor blades (helicopter).
  • Fire crakling: impulsive noise. But with pseudo-stationary background noise.
  • Clock tick: It depends. Impulsive every second (frequency= 1 Hz). But in some audio clips, there are several "pulsations" in a 1 second time frame. And the ticks have the signature of a non-linear mechanical vibration that radiates sound, with harmonics.

Quick Literature review

Methodology

  • In an effort to reduce the size of the problem and computation time, while retaining relevant information, we:
    • reduce audio sampling frequency from 44.1 kHz to 22.05 kHz.
    • reduce the size of audio clips, to 1.25s, based on signal power considerations. Too many audio clips have occurences of the same sound phenomenon: dog barking, baby crying for example and most of the signal is "silence".
  • Normalize audio signal amplitude to 1. (0 dBFS).
  • Compute mel-spectrograms or Wavelet transforms in the 10 classes. We empirically optimized wavelet selection. And wavelet transform parameters.
  • Reduce the size of scalograms in the time domain (some details are lost).
  • Train a CNN on 256x256 grayscale mel-spectrograms or 2 series of 128x128 grayscale scalograms: magnitude and phase. Train/Test split: 80/20 %

We tested three methods:

  • Mel-spectrograms.
  • Complex Continuous Wavelet Transforms (complex CWT).
  • Fusion mel-spectrograms + complex CWT.

After a 80%/20% train/test sets split, we train a Convolutional Neural Networks with 32-64-128-256 neurons hidden layers. Parameters are detailed in the notebooks CNN section.
Note: Although mel-spectrograms and wavelet transforms are shown in color, the CNN is trained with grayscale images.

ESC-10 Results Synthesis

Best accuracy with 3 different methods are synthesized in the the Table below.

Method Accuracy
256x256 Mel-spectrograms 92.5 %
128x128 Complex CWT Scalograms Magnitude + Phase 94 %
128x128 Fusion Complex CWT + Mel-Spectrograms 99 %

Details of the best result with the "Fusion" method:

Classification report

Confusion matrix



Jupyter Notebooks

All Jupyter Notebooks share the same structure. They are identical except when we implement wavelet transforms or mel-spectrogram transforms.

Reduction of audio clips length and optimization of mel-spectrogram parameters for best discrimination of sound categories. We train the CNN with 256x256 grayscale images. Accuracy: ~92.5%

Mel-spectrograms (dB)


Optimization of wavelet selection and parameters for best discrimination of sound classes.
Wavelet selection: the difficulty here is the selection of the right wavelet suited to the full range of noise types: pseudo-stationary, non-stationary, transient/impulsive.
Applying different wavelets to each type of sound significantly improves classification accuracy. We train the CNN with 2 128x128 grayscale images per audio clip: scalogram magnitude and phase. Accuracy ~ 94%.

Scalograms magnitude (dB)

Scalograms phase (rad)


Combining Mel-Spectrograms (Part I) with Complex Wavelets Transforms (Part II) enhances accuracy with features that are difficult to discriminate. We train the CNN with 3 128x128 grayscale images per audio clip. Accuracy. ~ 99%.

Rooster: Scalogram Magnitude (dB), Phase (rad) + Mel-spectrogram (dB)

License

ESC-50: Dataset for Environmental Sound Classification
https://github.com/karoldvl/ESC-50/
https://dx.doi.org/10.7910/DVN/YDEPUT

Dataset license

The dataset as a whole is available under the terms of the Creative Commons Attribution-NonCommercial license (http://creativecommons.org/licenses/by-nc/3.0/).

The ESC-10 subset is licensed as a Creative Commons Attribution 3.0 Unported
(https://creativecommons.org/licenses/by/3.0/) dataset.

Licensing/attribution details for individual audio clips are available in file:

License

deep-learning-and-digital-signal-processing-for-environmental-sound-classification's People

Contributors

drstef avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.