Automatic environmental sound classification (ESC) based on ESC-50 dataset (and ESC-10 subset) built by Karol Piczak and described in the following article:
"Karol J. Piczak. 2015. "ESC: Dataset for Environmental Sound Classification." In Proceedings of the 23rd ACM international conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1015โ1018. https://doi.org/10.1145/2733373.2806390".
ESC-50 dataset is available from Dr. Piczak's Github: https://github.com/karoldvl/ESC-50/ The following recent article is a descriptive survey for Environmental sound classification (ESC) detailing datasets, preprocessing techniques, features and classifiers. And their accuracy.
Anam Bansal, Naresh Kumar Garg, "Environmental Sound Classification: A descriptive review of the literature, Intelligent Systems with Applications, Volume 16, 2022, 200115, ISSN 2667-3053, https://doi.org/10.1016/j.iswa.2022.200115.
Dr. Piczak maintains a Table with best results in his Github, with authors, publication, method used. We reproduce the top of the Table here, for supervised classification.
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
BEATs: Audio Pre-Training with Acoustic Tokenizers | Transformer model pretrained with acoustic tokenizers | 98.10% | chen2022 | ๐ |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules | 97.00% | chen2022 | ๐ |
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 96.70% | elizalde2022 | ๐ |
AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet | 95.70% | gong2021 | ๐ |
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer | A Transformer model pretrained w/ visual image supervision | 95.70% | zhao2022 | ๐ |
We develop our own pre-processing techniques for achieving best accuracy results based on Dr. Piczak Table and Bansal et al.
At that point, and before we start working on more advanced techniques:
- we work with the ESC-10 sub-dataset.
- we test mel-spectrograms and wavelet transforms.
We will train a Convolution Neural Network with grayscale spectrograms and scalograms. We target an accuracy >>90 %.
When tests with the most effective CNN algorithm implementation are completed, we will run predictions with various audio clips downloaded from Youtube. And eventually update CNN hyperparameters.
The ESC-10 dataset contains 5 seconds long 400 Ogg Vorbis audio clips: sampling frequency: 44.1 kHz, 32- bits float, and 10 classes.
40 audio clips per class.
The 10 Sound/Noise classes are:
- Class = 01-Dogbark, Label = 0
- Class = 02-Rain, Label = 1
- Class = 03-Seawaves, Label = 2
- Class = 04-Babycry, Label = 3
- Class = 05-Clocktick, Label = 4
- Class = 06-Personsneeze, Label = 5
- Class = 07-Helicopter, Label = 6
- Class = 08-Chainsaw, Label = 7
- Class = 09-Rooster, Label = 8
- Class = 10-Firecrackling, Label = 9
Quick analysis of the type of sound/noise:
- dogbarking, babycry, person sneeze, rooster, involve non-linear vibration and resonance of vocal (or nasal) tract and cords, a bit like speech, and is considered non-stationnary.
- Rain, sea waves are somewhat stationary, rain sounds a bit like white noise. Pseudo-stationnary because in various audio clips other noises are involved at times.
- Helicopter, chainsaw: pseudo-stationary. If the engine r.p.m does not change in a timeframe, the process is stationary. With harmonics linked to the engine rpm, number of cylinders, and the number of rotor blades (helicopter).
- Fire crakling: impulsive noise. But with pseudo-stationary background noise.
- Clock tick: It depends. Impulsive every second (frequency= 1 Hz). But in some audio clips, there are several "pulsations" in a 1 second time frame. And the ticks have the signature of a non-linear mechanical vibration that radiates sound, with harmonics.
- In an effort to reduce the size of the problem and computation time, while retaining relevant information, we:
- reduce audio sampling frequency from 44.1 kHz to 22.05 kHz.
- reduce the size of audio clips, to 1.25s, based on signal power considerations. Too many audio clips have occurences of the same sound phenomenon: dog barking, baby crying for example and most of the signal is "silence".
- Normalize audio signal amplitude to 1. (0 dBFS).
- Compute mel-spectrograms or Wavelet transforms in the 10 classes. We empirically optimized wavelet selection. And wavelet transform parameters.
- Reduce the size of scalograms in the time domain (some details are lost).
- Train a CNN on 256x256 grayscale mel-spectrograms or 2 series of 128x128 grayscale scalograms: magnitude and phase. Train/Test split: 80/20 %
We tested three methods:
- Mel-spectrograms.
- Complex Continuous Wavelet Transforms (complex CWT).
- Fusion mel-spectrograms + complex CWT.
After a 80%/20% train/test sets split, we train a Convolutional Neural Networks with 32-64-128-256 neurons hidden layers. Parameters are detailed in the notebooks CNN section.
Note: Although mel-spectrograms and wavelet transforms are shown in color, the CNN is trained with grayscale images.
Best accuracy with 3 different methods are synthesized in the the Table below.
Method | Accuracy |
---|---|
256x256 Mel-spectrograms | 92.5 % |
128x128 Complex CWT Scalograms Magnitude + Phase | 94 % |
128x128 Fusion Complex CWT + Mel-Spectrograms | 99 % |
Details of the best result with the "Fusion" method:
All Jupyter Notebooks share the same structure. They are identical except when we implement wavelet transforms or mel-spectrogram transforms.
Reduction of audio clips length and optimization of mel-spectrogram parameters for best discrimination of sound categories. We train the CNN with 256x256 grayscale images. Accuracy: ~92.5%
Optimization of wavelet selection and parameters for best discrimination of sound classes.
Wavelet selection: the difficulty here is the selection of the right wavelet suited to the full range of noise types: pseudo-stationary, non-stationary, transient/impulsive.
Applying different wavelets to each type of sound significantly improves classification accuracy. We train the CNN with 2 128x128 grayscale images per audio clip: scalogram magnitude and phase. Accuracy ~ 94%.
Combining Mel-Spectrograms (Part I) with Complex Wavelets Transforms (Part II) enhances accuracy with features that are difficult to discriminate. We train the CNN with 3 128x128 grayscale images per audio clip. Accuracy. ~ 99%.
ESC-50: Dataset for Environmental Sound Classification
https://github.com/karoldvl/ESC-50/
https://dx.doi.org/10.7910/DVN/YDEPUT
Dataset license
The dataset as a whole is available under the terms of the Creative Commons Attribution-NonCommercial license (http://creativecommons.org/licenses/by-nc/3.0/).
The ESC-10 subset is licensed as a Creative Commons Attribution 3.0 Unported
(https://creativecommons.org/licenses/by/3.0/) dataset.
Licensing/attribution details for individual audio clips are available in file: