Code Monkey home page Code Monkey logo

song-popularity-prediction's Introduction

Song-Popularity-Prediction

This repository contains the code of my Bachelor's thesis, titled "Prediction of song popularity with a textual and audio based approach", and available for download at http://tesi.luiss.it/31943/ (alternatively see Thesis.pdf).

TL;DR This thesis proposes a blended approach for predicting song popularity combining some textual approaches presented in Berger et al. (2018; 2020), with an audio based one suggested in Lee et al. (2018).

For a more extensive description refer to the thesis document.

Scope

The main objective is to verify the propositions presented in Berger et al. (2018; 2020) by testing them for statistical significance, and to be able to generalize their approach to any new data. Hence i will reproduce their analysis from scratch (starting from data collection). Then, i will extend the analysis by including some audio-based features described in Lee et al. (2018).

Data

The original dataset from Berger et al. contains data about 4200 songs and was scraped from Billboard.com (see You_S1_Data_NoBillboardRanking.csv at https://osf.io/cbguq). In the specific, the authors scraped the weekly most popular downloaded songs chart (top-50), quarterly for 3 years, and for 7 major genres (i.e., christian, country, dance, rock, pop, rap, r&b).

To aquire the Billboard's data refer to BillboardScraper.py, which returns the following columns: song name, artist name, genre, date and rank. However notice that, as of now (2023) and unlike few years ago, the Digital Song Sales charts are only accessible upon a monthly payment (sc. Billboard Pro). The song lyrics were added instead using AddLyrics.py.

The audio tracks, needed to extend the analysis, were downloaded in mp3 format from Youtube.com (see YoutubeToMp3.py).

Feature Engineering

Most of the variables had to be feature engineered.

For the Textual Analysis, after some preprocessing (which includes Case Normalization, Tokenization, Stop words removal and Lemmatization) the following features were built:

  • The top 100 words appearing across all songs (excl. second-person pronouns) computed using Term Frequency (TF).
  • The number of times a song appeared on the charts.
  • The number of different genres a song belongs to.
  • A boolean value indicating whether a song appeared on the radio airplay chart of Billboard in that period (again only available, in 2023, with Billboard Pro).
  • Second Person Pronouns and other word metrics (eg. cognitive words, affect words, etc) extracted using Linguistic Inquiry and Word Count (LIWC) 2015.
  • 10 different topics extracted by performing Latent Dirichlet Allocation (LDA) on lyrics (see LDA.r).
  • A custom version of the Linguistic Style Matching (LSM) equation, readapted for topic composition in order to measure difference rather than similarity, and computed using the song topic composition and the average topic composition per genre.

For the Audio Analysis instead, the following features were built (by far the most technical part, that required some signal processing skills):

  • The Structural Change of two Complexity Features (ie. Chroma and Timbre) as described in Lee et al.
  • Arousal (ibid.).
  • MFCC (ibid.).

Research Questions

  1. Are the propositions present in Berger and Packard 2020 and Berger and Packard 2018 verified for statistical significance, as well as the ones in Lee and Lee 2018?

  2. How could best performances be achieved on the song popularity prediction task? Which variables should the best model include?

Analysis

  1. The statistical significance of the feature engineered variables for the textual analysis was checked using an Ordinary Least Squares linear (OLS) regression, reproducing 9 models described in Berger et al. (see also section 4.3 of my thesis). For the audio analysis instead, 4 models from the Lee et al. paper were reproduced.

  2. Three Random Forests and three Support Vector Machines models were built for predicting song’s rank, tuning the hyperparameters using GridSearchCV and selecting the best one in terms of performance.

For the actual implementation please refer to Thesis_Notebook.ipynb.

Results

Everything is explained thoroughly in Section 5 and 6 of Thesis.pdf

(Few) References

  • Berger, Jonah and Grant Packard (2020), “Thinking of You: How Second-Person Pronouns Shape Cultural Success”
  • Berger, Jonah and Grant Packard (2018), “Are Atypical Things More Popular?”
  • Lee, Jong-Seok and Junghyuk Lee (2018), “Music Popularity: Metrics, Characteristics, and Audio-Based Prediction”

song-popularity-prediction's People

Contributors

alessiobar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.