Code Monkey home page Code Monkey logo

lyricsgenreclassification's Introduction

LyricsGenreClassification

In this project, we try to train a neural network to discover the genre of a music track only by looking at its lyrics text.

Data set

Regarding the datasets used in this project, we found in two papers a link pointing to a compilation of "55000+ song-lyrics in English from LyricsFreak", made by user Kuznetsov. Unfortunately, the songs in the dataset were not labeled with its corresponding musical genres, so we had to fetch ourselves the labels from an external site, namely the last.fm website, which provided us with the tags we were looking for. Ultimately, we decided to discard too general genres like pop, rock etc., and focus only on genres that are a priori much easier to distiguish, i.e. metal, hip-hop, country, religious and jazz. After this filtering process, the original 55.000 dataset shrinked to less than 10.000. Because of this and because we thought a bigger dataset not used in the above mentioned papers before would be ideal, we then searched for another corpus in kaggle and found one named "Every song you have heard (almost). "Over 500,000 song lyric urls for over a million artists" of which just the first part/half of 250.000 was considered, as it was very time-consuming to preprocess all of the data. Once again the songs were not labelled, so this had to be done as well thanks to the api service of last.fm and after filtering the data to obtain just the songs with the genres of interest for our use case, this dataset shrinked to about 30.000 that in the end also showed better results in the project implementation.

Links to the two datasets:

Preprocessing

The preprocessing consists of several steps in order to convert the plaintext of the song-lyrics into a format that is processable by a neural network.

1. Read data and filter for relevant data

The .csv-file of the dataset contains around 250.000 songs. We only want to work with english songs. Therefore, we detect the languages of the lyircs using the python package cld2 and remove all non-english songs from the dataset. After filtering out genres which are not interesting for us, the dataset has a size of 30.138 songs. The genres we are interested in are listed below with their respective quantities of songs.

Genre Number samples
Metal 9.147
Country 7.852
Jazz 5.860
Hip-Hop 5.420
Religious 1.859

The filtered data set containing the genres can be found here.

2. Expand contractions

In this step, all the lyrics texts are taken and the contractions are expanded.

3. Tokenization

The lyrics, which are still encoded as normal strings, are now tokenized ignoring the sentence- or newline-structure. The output of tokenization is a simple list of words.

4. POS tagging and lemmatization

In this step, the tokenized lyrics texts are taken, first POS tagged and then lemmatized. The POS tagging converts the list of words from above to pairs of (word, POS tag). After the POS tagging, lemmatization is applied and the text is transferred back to a normal list of words representation.

5. Filter stopwords

The last step is to filter out the stopwords from the lyrics texts as they don't contain any valuable information we need. Removing the stopwords makes the texts quite shorter than before

In the end, the preprocessed data set is written to a .npy-file, so that it can be easily used for later steps by simply loading it from the file, without having to preprocess the data again. The data set is written to the file as a numpy array, where the first column contains the genre and the second column contains the list of words representing the lyrics of the song.

Feature Extraction

In the feature extraction, the dataset from the preprocessing is read and numerical features are extracted.
A TF-IDF vectorizer object is created from the lyrics data as well as a word2vec model. Together, the two models are used to generate a numerical vector representation of each lyric-text, using 75 dimensions.

Those vectors are again written to a .npy-file and can then be used by a neural network.

Model training

We tried two different approaches for training our neural network for lyrics-genre classification:

  1. use an LSTM network directly on the preprocessed text-data (implemented in ClassificationLSTM.py)
  2. apply feature extraction to the preprocessed text-data and feed the numerical vectors into a fully connected neural network (implemented in Classification.py)

After several tests and modifications, we came to the conclusion that the second mentioned approach yields a better accuracy and is also way faster to train.

The optimal parameters for the fully connected neural network were:

  • Learning rate: 0.01
  • Batch size: 256
  • Number of steps: 20.000
  • 2 layers of 32 and 64 neurons

The model achieved an average test-accuracy of 74% using 5-Fold cross-validation.

Prediction

In the end, one should be able to make actual predictions with the created model. Therefore we stored the final model in a file in order to allow to load it at a later point in time. I.e. the weights and biases are stored to later make a prediction within the Prediction.py file. In this file, a lyrics text can be defined. This text is then preprocessed, converted to numerical vectors and can be fed into the neural network.
Due to time-reasons, we could not finish the very last step of loading the stored model and doing an actual prediction for the defined lyrics-string.

lyricsgenreclassification's People

Contributors

mrrocketraccoon avatar wiefel avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.