Summary Extraction from Spotify Podcasts

Link to project tutorial and overview video.

Description

This repository contains the code and reports used in CS410 Fa21 course project. This document describes the usage of all scripts and functions defined in them. For more information about the project tasks, please see the the proposal and progress-report.

File Descriptions

File Name	Task Summary
`extract_data.py`	Extract episode transcripts and creator descriptions from the raw dataset. Process them combined CSVs using episode prefix as unique id
`content_selection.py`	Reduces episode transcripts to the top 5 most important sentences according to TextRank algorithm
`t5_training.py`	Script for fine tuning the T5 transformer model on the preprocessed podcast dataset
`walkthrough.ipynb`	Tutorial Notebook for code illustration and Evaluation of 3 methods

Each of the above files are accompanied by a corresponding ipynb notebook file as well. CAs/TAs are requested to use those notebooks for evaluation. They can be imported into a local jupyter instance or Google Colab

Function Descriptions:

`extract_episode_lists_and_transcripts`

Source: extract_data.py

Inputs: data_id (int) Each of the raw data gzip files decompress into spotify-podcasts-2020/podcasts-transcripts/<data_id>.

Output: tuple of two dicts - show_episodes_dict (Schema: {show_prefix : [episode_prefix]}) and episode_transcript_dict (Schema: {episode_prefix: transcript_text})

`generate_summ_dataset`

Source: extract_data.py

Inputs: None

Outputs: Reads the dictionaries generated by extract_episode_lists_and_transcripts and writes a merged CSV file with the columns (ep_prefix, ep_transcript, ep_description)

`text_rank_selection`

Source: content_selection.py

Inputs: transcript (str), n (int)

Outputs: Uses summa library to get the top 5 sentences from episode transcript text. Uses n to determine the percentage of sentences to feed into the ratio param in summa.summarizer(ratio=)

`get_word_len`

Source: t5_training.py

Inputs: row (Series)

Outputs: Returns the length of the given podcast transcript.

`cap_word_len`

Source: t5_training.py

Inputs: row (Series)

Outputs: Caps the transcript length to 7000 words if it is longer, else returns the row as it is

`train`

Source: t5_training.py

Inputs: epoch (int), model (Model), tokenizer (Model Tokenizer), device (CPU/GPU), loader (CustomDataloader), optimizer (torch Optimizer)

Outputs: Defines one step of training for the the given transformer model for one epoch based on the batch size parameters passed to the custom dataloader. Optimization of parameters happens via passed optimizer.

`validate`

Source: t5_training.py

Inputs: epoch (int), model (Model), tokenizer (Model Tokenizer), device (CPU/GPU), loader (CustomDataloader)

Outputs: Evaluates the model for the given epoch by sampling data based on batch size parameters from the passed custom dataloader. Returns the predictions and actuals pairs.

`get_textrank_scores`

Source: walkthrough.ipynb

Inputs: val_subset (DataFrame)

Outputs: Evaluates the textrank model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset.

`get_t5_scores`

Source: walkthrough.ipynb

Inputs: val_subset (DataFrame), model (Model), tokenizer (Model Tokenizer)

Outputs: Evaluates the T5 (Off the shelf) model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset. This is done based on the passed model and corresponding tokenizer.

`get_t5_finetuned_scores`

Source: walkthrough.ipynb

Inputs: val_subset (DataFrame), model_final (Model), tokenizer_final (Model Tokenizer)

Outputs: Evaluates the T5 (fine tuned on Spotify dataset) model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset. This is done based on the passed model and corresponding tokenizer.

Usage

Install dependencies

pip install -r requirements.txt

Run data extraction

Load extract_data.ipynb into jupyter or Google Colab. Execute all cells.

Perform content selection using TextRank

Load content_selection.ipynb into jupyter or Google Colab. Execute all cells.

Finetune T5 model on Spotify Dataset

Load t5_training.ipynb into jupyter or Google Colab. Execute all cells. Note: We have followed the tutorial on training the t5 transformer model provided by https://github.com/abhimishra91/transformers-tutorials.

Tutorial Notebook for evaluation and inference on the three methods

Load walkthrough.ipynb into jupyter or Google Colab. Execute all cells.

Results

Quantitative Results

The following table outlines our results on a subset of the validation dataset:

Method Name	Average ROUGE F1 Score
TextRank	0.109
T5 (Off the shelf)	0.144
T5 (Finetuned on Spotify Dataset)	0.312

While comparing ROUGE scores, we see that the FineTuned model performs well in comparison to other baselines, which confirms our expectation that domain adaptation on the Spotify dataset is a necessary step towards a higher score.

Human Evaluation

We used five English speaking volunteers to score the summaries into the defined spectrum of Bad(B) to Excellent(E), as defined by the original paper. This is so that we can effectively capture the subjectivity of how good or bad a summary is, based on how relevant it is to a human evaluator.

Grade	TextRank	T5-Pretrained	T5-FineTuned
Excellent (E)	0	0	0
Bad (B)	0	3	0
Fair (F)	3	1	3
Good (G)	2	1	2

3 out of 5 people felt that the summaries generated by TextRank and Fine-Tuned were comparable, and rated it Fair(F).
1 evaluator felt that TextRank is definitely better, and 1 Evaluator felt that given enough data, FineTuned T5 is a much better abstraction of the podcast transcript.
5 out of 5 evaluators agreed that the FineTuned T5 generated better summaries than Off-the-shelf Pretrained T5. This validates our assumption about the need to perform domain adaptation.

Room for improvement and Error Analysis

Dataset - Perhaps episode description is not the ideal ground truth to represent summary, since it often contains promotional material which the model learns to generate after every summary, leading to post processing overhead.
Compute - Since even on the best settings on Colab, the T5 model can only take a certain amount of tokens, perhaps given enough compute T5 has the potential to generate even better summaries. Nevertheless, deep learning based techniques seem to be infeasible for simple use cases.

Statement of personal contribution

Dattatreya Mohapatra (Captain)

Dattatreya has been responsible for aquiring and preprocessing the dataset from its raw form to the tabular format for feeding the dataset into the proposed models. This process involved aggregating multiple JSON fragments into a single source of truth. Dattatreya also initiated the process of running TextRank (the intitial method) and conducted sample evaluations to test the code infrastucture end to end.

Arijit Ghosh Chowdhury

Arijit has been responsible for picking up on the TextRank and starting the implementation for training T5 transformer. This involved conducting a full scale analysis of using TextRank and experimenting with different off the shelf T5 transformers from HuggingFace which would be most apt for the current context i.e. the Spotify Dataset.

Gargi Balasubramaniam

Gargi has been responsible for continuing with the T5 approach and fine tuning the T5 transformer on the spotify dataset in contrast to using an off the shelf version. This also involved tuning hyperparameters of the model (learning rate, epochs etc.). Gargi also prepared the evaluation scripts for evaluating all the three methods on a validation dataset.

Overall, while there was a sequence in the order of tasks picked up by each person, all of the team members contributed equally throughout the ideation process and provided required assistance to fellow teammates when required.

dattatreya303 / courseproject Goto Github PK

courseproject's Introduction

Summary Extraction from Spotify Podcasts

Description

File Descriptions

Function Descriptions:

extract_episode_lists_and_transcripts

generate_summ_dataset

text_rank_selection

get_word_len

cap_word_len

train

validate

get_textrank_scores

get_t5_scores

get_t5_finetuned_scores

Usage

Install dependencies

Run data extraction

Perform content selection using TextRank

Finetune T5 model on Spotify Dataset

Tutorial Notebook for evaluation and inference on the three methods

Results

Quantitative Results

Human Evaluation

Room for improvement and Error Analysis

Statement of personal contribution

Dattatreya Mohapatra (Captain)

Arijit Ghosh Chowdhury

Gargi Balasubramaniam

courseproject's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

`extract_episode_lists_and_transcripts`

`generate_summ_dataset`

`text_rank_selection`

`get_word_len`

`cap_word_len`

`train`

`validate`

`get_textrank_scores`

`get_t5_scores`

`get_t5_finetuned_scores`