Code Monkey home page Code Monkey logo

transformer-based-ser's Introduction

Transformer-based model for Speech Emotion Recognition(SER) - Implemented by Pytorch

Overview:

There are two classes of features to solve the emotion recognition problem from speech: lexical features (the vocabulary used) and acoustic features (sound properties). We could also use both to solve the problem. But note that using lexical or linguistic features would require having a transcript of the speech; in other words, it requires an additional step for text extraction from speech (speech recognition). Hence, in this project, we only use acoustic features.

Further, there are two approaches to representing emotions:

  • Dimensional Representation: Representing emotions with dimensions such as Valence (on a negative to positive scale), Activation or Energy (on a low to high scale), and Dominance (on an active to passive scale).
  • Discrete Classification: Classifying emotions in discrete labels like anger, happiness, etc.

Dimensional Representation is more elaborate and gives more information. However, due to the lack of annotated audio data in a dimensional format, we used a discrete classification approach in this project.

Model

The model comprises two main parts: a pre-trained speech model based on transformer architecture to extract features (embedding vectors), named HuBERT, and accepts a float array corresponding to the raw waveform of the speech signal. The second part is a classifier head that takes the Hubert output and contains two linear layers and a tanh activation function.
Note that loading the Hubert is performed with the help of the AutoModel class (from Huggingface ) and just by changing the model_checkpoint variable (in config.py ), you could use other architectures like Wav2vec 2.0 and WavLM. (for more information, read this Huggingface document).

Dataset

I used the ShEMO (Sharif Emotional Speech Database) to train and evaluate the model. The database includes 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data extracted from online radio plays. As you can see in the bar chart, the dataset is very imbalanced which makes classifying harder, especially in minority classes. So we used data augmentation methods to improve the performance and accuracy of the model.

transformer-based-ser's People

Contributors

hoseinazad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.