Code Monkey home page Code Monkey logo

courseproject's Introduction

Summary Extraction from Spotify Podcasts

Link to project tutorial and overview video.

Description

This repository contains the code and reports used in CS410 Fa21 course project. This document describes the usage of all scripts and functions defined in them. For more information about the project tasks, please see the the proposal and progress-report.

File Descriptions

File Name Task Summary
extract_data.py Extract episode transcripts and creator descriptions from the raw dataset. Process them combined CSVs using episode prefix as unique id
content_selection.py Reduces episode transcripts to the top 5 most important sentences according to TextRank algorithm
t5_training.py Script for fine tuning the T5 transformer model on the preprocessed podcast dataset
walkthrough.ipynb Tutorial Notebook for code illustration and Evaluation of 3 methods

Each of the above files are accompanied by a corresponding ipynb notebook file as well. CAs/TAs are requested to use those notebooks for evaluation. They can be imported into a local jupyter instance or Google Colab

Function Descriptions:

extract_episode_lists_and_transcripts

Source: extract_data.py

Inputs: data_id (int) Each of the raw data gzip files decompress into spotify-podcasts-2020/podcasts-transcripts/<data_id>.

Output: tuple of two dicts - show_episodes_dict (Schema: {show_prefix : [episode_prefix]}) and episode_transcript_dict (Schema: {episode_prefix: transcript_text})

generate_summ_dataset

Source: extract_data.py

Inputs: None

Outputs: Reads the dictionaries generated by extract_episode_lists_and_transcripts and writes a merged CSV file with the columns (ep_prefix, ep_transcript, ep_description)

text_rank_selection

Source: content_selection.py

Inputs: transcript (str), n (int)

Outputs: Uses summa library to get the top 5 sentences from episode transcript text. Uses n to determine the percentage of sentences to feed into the ratio param in summa.summarizer(ratio=)

get_word_len

Source: t5_training.py

Inputs: row (Series)

Outputs: Returns the length of the given podcast transcript.

cap_word_len

Source: t5_training.py

Inputs: row (Series)

Outputs: Caps the transcript length to 7000 words if it is longer, else returns the row as it is

train

Source: t5_training.py

Inputs: epoch (int), model (Model), tokenizer (Model Tokenizer), device (CPU/GPU), loader (CustomDataloader), optimizer (torch Optimizer)

Outputs: Defines one step of training for the the given transformer model for one epoch based on the batch size parameters passed to the custom dataloader. Optimization of parameters happens via passed optimizer.

validate

Source: t5_training.py

Inputs: epoch (int), model (Model), tokenizer (Model Tokenizer), device (CPU/GPU), loader (CustomDataloader)

Outputs: Evaluates the model for the given epoch by sampling data based on batch size parameters from the passed custom dataloader. Returns the predictions and actuals pairs.

get_textrank_scores

Source: walkthrough.ipynb

Inputs: val_subset (DataFrame)

Outputs: Evaluates the textrank model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset.

get_t5_scores

Source: walkthrough.ipynb

Inputs: val_subset (DataFrame), model (Model), tokenizer (Model Tokenizer)

Outputs: Evaluates the T5 (Off the shelf) model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset. This is done based on the passed model and corresponding tokenizer.

get_t5_finetuned_scores

Source: walkthrough.ipynb

Inputs: val_subset (DataFrame), model_final (Model), tokenizer_final (Model Tokenizer)

Outputs: Evaluates the T5 (fine tuned on Spotify dataset) model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset. This is done based on the passed model and corresponding tokenizer.

Usage

Install dependencies

pip install -r requirements.txt

Run data extraction

Load extract_data.ipynb into jupyter or Google Colab. Execute all cells.

Perform content selection using TextRank

Load content_selection.ipynb into jupyter or Google Colab. Execute all cells.

Finetune T5 model on Spotify Dataset

Load t5_training.ipynb into jupyter or Google Colab. Execute all cells. Note: We have followed the tutorial on training the t5 transformer model provided by https://github.com/abhimishra91/transformers-tutorials.

Tutorial Notebook for evaluation and inference on the three methods

Load walkthrough.ipynb into jupyter or Google Colab. Execute all cells.

Each of the above files are accompanied by a corresponding ipynb notebook file as well. CAs/TAs are requested to use those notebooks for evaluation. They can be imported into a local jupyter instance or Google Colab

Results

Quantitative Results

The following table outlines our results on a subset of the validation dataset:

Method Name Average ROUGE F1 Score
TextRank 0.109
T5 (Off the shelf) 0.144
T5 (Finetuned on Spotify Dataset) 0.312

While comparing ROUGE scores, we see that the FineTuned model performs well in comparison to other baselines, which confirms our expectation that domain adaptation on the Spotify dataset is a necessary step towards a higher score.

Human Evaluation

We used five English speaking volunteers to score the summaries into the defined spectrum of Bad(B) to Excellent(E), as defined by the original paper. This is so that we can effectively capture the subjectivity of how good or bad a summary is, based on how relevant it is to a human evaluator.

Grade TextRank T5-Pretrained T5-FineTuned
Excellent (E) 0 0 0
Bad (B) 0 3 0
Fair (F) 3 1 3
Good (G) 2 1 2
  • 3 out of 5 people felt that the summaries generated by TextRank and Fine-Tuned were comparable, and rated it Fair(F).
  • 1 evaluator felt that TextRank is definitely better, and 1 Evaluator felt that given enough data, FineTuned T5 is a much better abstraction of the podcast transcript.
  • 5 out of 5 evaluators agreed that the FineTuned T5 generated better summaries than Off-the-shelf Pretrained T5. This validates our assumption about the need to perform domain adaptation.

Room for improvement and Error Analysis

  • Dataset - Perhaps episode description is not the ideal ground truth to represent summary, since it often contains promotional material which the model learns to generate after every summary, leading to post processing overhead.
  • Compute - Since even on the best settings on Colab, the T5 model can only take a certain amount of tokens, perhaps given enough compute T5 has the potential to generate even better summaries. Nevertheless, deep learning based techniques seem to be infeasible for simple use cases.

Statement of personal contribution

Dattatreya Mohapatra (Captain)

Dattatreya has been responsible for aquiring and preprocessing the dataset from its raw form to the tabular format for feeding the dataset into the proposed models. This process involved aggregating multiple JSON fragments into a single source of truth. Dattatreya also initiated the process of running TextRank (the intitial method) and conducted sample evaluations to test the code infrastucture end to end.

Arijit Ghosh Chowdhury

Arijit has been responsible for picking up on the TextRank and starting the implementation for training T5 transformer. This involved conducting a full scale analysis of using TextRank and experimenting with different off the shelf T5 transformers from HuggingFace which would be most apt for the current context i.e. the Spotify Dataset.

Gargi Balasubramaniam

Gargi has been responsible for continuing with the T5 approach and fine tuning the T5 transformer on the spotify dataset in contrast to using an off the shelf version. This also involved tuning hyperparameters of the model (learning rate, epochs etc.). Gargi also prepared the evaluation scripts for evaluating all the three methods on a validation dataset.

Overall, while there was a sequence in the order of tasks picked up by each person, all of the team members contributed equally throughout the ideation process and provided required assistance to fellow teammates when required.

courseproject's People

Contributors

bgargi avatar dattatreya303 avatar arijit1410 avatar bhaavya avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.