Code Monkey home page Code Monkey logo

elmarc's Introduction

Randomly weighted CNNs for (music) audio classification

This study builds on top of prior works showing that the (classification) performance delivered by random CNN features is correlated with the results of their end-to-end trained counterparts [1]. Put in other words: architectures which perform well with random weights also tend to perform well with trained weights. We use this property to run a fast and comprehensive evaluation of the current deep architectures for (music) audio. Our method works as follows: first, we extract a feature vector from the embeddings of a randomly weighted CNN; and then, we input these features to a classifier – which can be a support vector machine (SVM) or an extreme learning machine (ELM). Our goal is to compare the obtained classification accuracies when using different CNN architectures.

We consider three datasets in our study. The first one, is the fault-filtered GTZAN dataset for music genre classification – classification accuracy is the figure-of-merit expressed in the vertical axis of the following figures:

The second dataset is the Extended Ballroom (meant to learn how to discriminate rhythm/tempo music classes):

And the last dataset is the Urban Sounds 8k, composed of (non-music) sounds:

Observations

  • The results we obtain are far from random, since: 1) randomly weighted CNNs are (in some cases) close to match the accuracies obtained by trained CNNs – see the paper; and 2) these are able to outperform MFCCs.

  • Extended Ballroom dataset experiments: (musical) priors embedded in the structure of the model facilitate capturing useful (temporal) cues for classifying rhythm/tempo classes – see the accuracy performance of the (non-trained) Temporal architecture (89.82%), which is very close to a state-of-the-art trained CNN (93.7%).

  • Waveform front-ends: sample-level >> frame-level many-shapes > frame-level – as noted in the (trained) literature.

  • Spectrogram front-ends: 7x96 < 7x86 – as shown in prior (trained) works.

Dependencies

You need to install the following dependencies: tensorflow, librosa, pandas, numpy, scipy, sklearn, pickle. It is not a bad idea to run these models on a CPU – therefore, we recommend to install the CPU version of tensorflow.

The public extreme learning machine implementation we use (already included in this repo) can be found here.

Usage

Set src/config_file.py and run: python main.py

Some documentation is available in config_file.py, but here some examples in how to set the configuration file:

config_main = {

    # Experimental setup
    'experiment_name': 'v0_or_any_name',
    'features_type': 'CNN',
    'model_type': 'SVM',
    'SVM_verbose': 1,
    'load_extracted_features': False,
    'sampling_rate': 12000,

    # Dataset configuration
    'dataset': 'GTZAN',
    'audio_path': '/path_to_audio/jpons/GTZAN/',
    'save_extracted_features_folder': '../data/GTZAN/features/',
    'results_folder': '../data/GTZAN/results/',
    'train_set_list': '/path_to_train_set/jpons/GTZAN_partitions/train_filtered.txt',
    'val_set_list': '/path_to_val_set/jpons/GTZAN_partitions/valid_filtered.txt',
    'test_set_list': '/path_to_test_set/jpons/GTZAN_partitions/test_filtered.txt',
    'audios_list': False,

    # Waveform model: sample level CNN
    'CNN': {
        'signal': 'waveform',
        'n_samples': 350000,

        'architecture': 'sample_level',
        'num_filters': 512,
        'selected_features_list': [0, 1, 2, 3, 4, 5, 6],

        'batch_size': 5
    }
}

As a result of this config file: the input waveforms of the GTZAN dataset are formatted to be of ≈ 29sec (350,000 samples at 12kHz), features are computed in batches of 5, and we use an SVM classifier.

This experiment runs the sample_level CNN architecture with 512 filters in every layer, and we use every feature map in every layer to compute the feature vector – see the implementation of the sample_level model at src/dl_models.py.

config_main = {

    # Experimental setup
    'experiment_name': 'v0_or_any_name',
    'features_type': 'CNN',
    'model_type': 'ELM',
    'load_extracted_features': False,
    'sampling_rate': 12000,

    # Dataset configuration
    'dataset': 'ExtendedBallroom',
    'audio_path': '/path_to_audio/jpons/extended_ballroom/',
    'save_extracted_features_folder': '../data/Extended_Ballroom/features/',
    'results_folder': '../data/Extended_Ballroom/results/',
    'train_set_list': None,
    'val_set_list': None,
    'test_set_list': None,
    'audios_list': '/path_to_all_audios/jpons/extended_ballroom/all_files.txt',

    # Spectrogram model: 7x86 CNN
    'CNN': {
        'signal': 'spectrogram',
        'n_mels': 96,
        'n_frames': 1376,

        'architecture': 'cnn_single',
        'num_filters': 3585,
        'selected_features_list': [1],
        'filter_shape': [7,86],
        'pool_shape': [1,11],

        'batch_size': 5        
    }
}

As a result of this config file: the input spectrograms of the Extended Ballroom dataset are formatted to be of ≈ 29sec (1376 frames at 12kHz), features are computed in batches of 5, and we use an ELM classifier.

This experiment runs the 7x86 CNN architecture with 3585 filters, and we use every feature map (of this single-layered CNN) to compute the feature vector – see the implementation of the 7x86 model at src/dl_models.py.

Reproducing our results

To reproduce our results, one needs to download the data and use the same partitions:

  • GTZAN fault-filtered version: download the data (link). Download the (.txt) files that list which audios are in every partition (link). Set the config file: 'train_set_list': 'path/train_audios.txt', 'val_set_list': 'path/val_audios.txt', 'test_set_list': 'path/test_audios.txt', 'audios_list': False.

  • Extended Ballroom: download the data (link). 10 stratified folds will be randomly generated (via a sklearn function) for cross-validation. List all the audios in a file, and set the config file: 'train_set_list': None, 'val_set_list': None, 'test_set_list': None, 'audios_list': 'path/all_audios.txt'.

  • Urban Sound 8k: download the data (link). Partitions were already defined by the dataset authors, and we have some code to get those! Just list all the audios in a file, and set the config file: 'train_set_list': None, 'val_set_list': None, 'test_set_list': None, 'audios_list': 'path/all_audios.txt'.

For more information, see the documentation available in config_file.py.

References

[1] Saxe, et al. On Random Weights and Unsupervised Feature Learning. In: ICML. 2011. p. 1089-1096.

elmarc's People

Contributors

jordipons avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.