Deep Learning and Digital Signal Processing for Environmental Sound Classification

Introduction

Automatic environmental sound classification (ESC) based on ESC-50 dataset (and ESC-10 subset) built by Karol Piczak and described in the following article:

"Karol J. Piczak. 2015. "ESC: Dataset for Environmental Sound Classification." In Proceedings of the 23rd ACM international conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1015–1018. https://doi.org/10.1145/2733373.2806390".

ESC-50 dataset is available from Dr. Piczak's Github: https://github.com/karoldvl/ESC-50/ The following recent article is a descriptive survey for Environmental sound classification (ESC) detailing datasets, preprocessing techniques, features and classifiers. And their accuracy.

Anam Bansal, Naresh Kumar Garg, "Environmental Sound Classification: A descriptive review of the literature, Intelligent Systems with Applications, Volume 16, 2022, 200115, ISSN 2667-3053, https://doi.org/10.1016/j.iswa.2022.200115.

Dr. Piczak maintains a Table with best results in his Github, with authors, publication, method used. We reproduce the top of the Table here, for supervised classification.

_Title	_Notes	_Accuracy	_Paper	_Code
_{BEATs: Audio Pre-Training with Acoustic Tokenizers}	_{Transformer model pretrained with acoustic tokenizers}	_98.10%	_chen2022	📜
_{HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection}	_{Transformer model with hierarchical structure and token-semantic modules}	_97.00%	_chen2022	📜
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_96.70%	_elizalde2022	📜
_{AST: Audio Spectrogram Transformer}	_{Pure Attention Model Pretrained on AudioSet}	_95.70%	_gong2021	📜
_{Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer}	_{A Transformer model pretrained w/ visual image supervision}	_95.70%	_zhao2022	📜

We develop our own pre-processing techniques for achieving best accuracy results based on Dr. Piczak Table and Bansal et al.
At that point, and before we start working on more advanced techniques:

we work with the ESC-10 sub-dataset.

we test mel-spectrograms and wavelet transforms.

We will train a Convolution Neural Network with grayscale spectrograms and scalograms. We target an accuracy >>90 %.
When tests with the most effective CNN algorithm implementation are completed, we will run predictions with various audio clips downloaded from Youtube. And eventually update CNN hyperparameters.

ESC-10 Type of sounds/noises

The ESC-10 dataset contains 5 seconds long 400 Ogg Vorbis audio clips: sampling frequency: 44.1 kHz, 32- bits float, and 10 classes.
40 audio clips per class.
The 10 Sound/Noise classes are:

Class = 01-Dogbark, Label = 0
Class = 02-Rain, Label = 1
Class = 03-Seawaves, Label = 2
Class = 04-Babycry, Label = 3
Class = 05-Clocktick, Label = 4
Class = 06-Personsneeze, Label = 5
Class = 07-Helicopter, Label = 6
Class = 08-Chainsaw, Label = 7
Class = 09-Rooster, Label = 8
Class = 10-Firecrackling, Label = 9

Quick analysis of the type of sound/noise:

dogbarking, babycry, person sneeze, rooster, involve non-linear vibration and resonance of vocal (or nasal) tract and cords, a bit like speech, and is considered non-stationnary.
Rain, sea waves are somewhat stationary, rain sounds a bit like white noise. Pseudo-stationnary because in various audio clips other noises are involved at times.
Helicopter, chainsaw: pseudo-stationary. If the engine r.p.m does not change in a timeframe, the process is stationary. With harmonics linked to the engine rpm, number of cylinders, and the number of rotor blades (helicopter).
Fire crakling: impulsive noise. But with pseudo-stationary background noise.
Clock tick: It depends. Impulsive every second (frequency= 1 Hz). But in some audio clips, there are several "pulsations" in a 1 second time frame. And the ticks have the signature of a non-linear mechanical vibration that radiates sound, with harmonics.

Quick Literature review

Methodology

In an effort to reduce the size of the problem and computation time, while retaining relevant information, we:
- reduce audio sampling frequency from 44.1 kHz to 22.05 kHz.
- reduce the size of audio clips, to 1.25s, based on signal power considerations. Too many audio clips have occurences of the same sound phenomenon: dog barking, baby crying for example and most of the signal is "silence".
Normalize audio signal amplitude to 1. (0 dBFS).
Compute mel-spectrograms or Wavelet transforms in the 10 classes. We empirically optimized wavelet selection. And wavelet transform parameters.
Reduce the size of scalograms in the time domain (some details are lost).
Train a CNN on 256x256 grayscale mel-spectrograms or 2 series of 128x128 grayscale scalograms: magnitude and phase. Train/Test split: 80/20 %

We tested three methods:

Mel-spectrograms.
Complex Continuous Wavelet Transforms (complex CWT).
Fusion mel-spectrograms + complex CWT.

After a 80%/20% train/test sets split, we train a Convolutional Neural Networks with 32-64-128-256 neurons hidden layers. Parameters are detailed in the notebooks CNN section.
Note: Although mel-spectrograms and wavelet transforms are shown in color, the CNN is trained with grayscale images.

ESC-10 Results Synthesis

Best accuracy with 3 different methods are synthesized in the the Table below.

_Method	_Accuracy
_{256x256 Mel-spectrograms}	_{92.5 %}
_{128x128 Complex CWT Scalograms Magnitude + Phase}	_{94 %}
_{128x128 Fusion Complex CWT + Mel-Spectrograms}	_{99 %}

Details of the best result with the "Fusion" method:


_{Classification report}	_{Confusion matrix}

Jupyter Notebooks

All Jupyter Notebooks share the same structure. They are identical except when we implement wavelet transforms or mel-spectrogram transforms.

Part I: Mel-Spectrograms and Convolutional Neural Networks (CNN)

Reduction of audio clips length and optimization of mel-spectrogram parameters for best discrimination of sound categories. We train the CNN with 256x256 grayscale images. Accuracy: ~92.5%


_{Mel-spectrograms (dB)}

Part II: Complex Wavelet Transform and Convolutional Neural Networks (CNN)

Optimization of wavelet selection and parameters for best discrimination of sound classes.
Wavelet selection: the difficulty here is the selection of the right wavelet suited to the full range of noise types: pseudo-stationary, non-stationary, transient/impulsive.
Applying different wavelets to each type of sound significantly improves classification accuracy. We train the CNN with 2 128x128 grayscale images per audio clip: scalogram magnitude and phase. Accuracy ~ 94%.


_{Scalograms magnitude (dB)}


_{Scalograms phase (rad)}

Part III: Fusion: Complex Wavelet Transforms + Mel-Spectrograms and CNN

Combining Mel-Spectrograms (Part I) with Complex Wavelets Transforms (Part II) enhances accuracy with features that are difficult to discriminate. We train the CNN with 3 128x128 grayscale images per audio clip. Accuracy. ~ 99%.


_{Rooster: Scalogram Magnitude (dB), Phase (rad) + Mel-spectrogram (dB)}

License

ESC-50: Dataset for Environmental Sound Classification
https://github.com/karoldvl/ESC-50/
https://dx.doi.org/10.7910/DVN/YDEPUT

Dataset license

The dataset as a whole is available under the terms of the Creative Commons Attribution-NonCommercial license (http://creativecommons.org/licenses/by-nc/3.0/).

The ESC-10 subset is licensed as a Creative Commons Attribution 3.0 Unported
(https://creativecommons.org/licenses/by/3.0/) dataset.

Licensing/attribution details for individual audio clips are available in file:

License

drstef / deep-learning-and-digital-signal-processing-for-environmental-sound-classification Goto Github PK