Link to project tutorial and overview video.
This repository contains the code and reports used in CS410 Fa21 course project. This document describes the usage of all scripts and functions defined in them. For more information about the project tasks, please see the the proposal and progress-report.
File Name | Task Summary |
---|---|
extract_data.py |
Extract episode transcripts and creator descriptions from the raw dataset. Process them combined CSVs using episode prefix as unique id |
content_selection.py |
Reduces episode transcripts to the top 5 most important sentences according to TextRank algorithm |
t5_training.py |
Script for fine tuning the T5 transformer model on the preprocessed podcast dataset |
walkthrough.ipynb |
Tutorial Notebook for code illustration and Evaluation of 3 methods |
Each of the above files are accompanied by a corresponding ipynb
notebook file as well. CAs/TAs are requested to use those notebooks for evaluation. They can be imported into a local jupyter instance or Google Colab
Source: extract_data.py
Inputs: data_id
(int) Each of the raw data gzip files decompress into spotify-podcasts-2020/podcasts-transcripts/<data_id>
.
Output: tuple of two dicts - show_episodes_dict
(Schema: {show_prefix : [episode_prefix]}
) and episode_transcript_dict
(Schema: {episode_prefix: transcript_text}
)
Source: extract_data.py
Inputs: None
Outputs: Reads the dictionaries generated by extract_episode_lists_and_transcripts
and writes a merged CSV file with the columns (ep_prefix, ep_transcript, ep_description)
Source: content_selection.py
Inputs: transcript
(str
), n
(int
)
Outputs: Uses summa
library to get the top 5 sentences from episode transcript text. Uses n
to determine the percentage of sentences to feed into the ratio
param in summa.summarizer(ratio=)
Source: t5_training.py
Inputs: row
(Series
)
Outputs: Returns the length of the given podcast transcript.
Source: t5_training.py
Inputs: row
(Series
)
Outputs: Caps the transcript length to 7000 words if it is longer, else returns the row as it is
Source: t5_training.py
Inputs: epoch
(int
), model
(Model
), tokenizer
(Model Tokenizer
), device
(CPU/GPU
), loader
(CustomDataloader
), optimizer
(torch Optimizer
)
Outputs: Defines one step of training for the the given transformer model for one epoch based on the batch size parameters passed to the custom dataloader. Optimization of parameters happens via passed optimizer.
Source: t5_training.py
Inputs: epoch
(int
), model
(Model
), tokenizer
(Model Tokenizer
), device
(CPU/GPU
), loader
(CustomDataloader
)
Outputs: Evaluates the model for the given epoch by sampling data based on batch size parameters from the passed custom dataloader. Returns the predictions and actuals pairs.
Source: walkthrough.ipynb
Inputs: val_subset
(DataFrame
)
Outputs: Evaluates the textrank model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset.
Source: walkthrough.ipynb
Inputs: val_subset
(DataFrame
), model
(Model
), tokenizer
(Model Tokenizer
)
Outputs: Evaluates the T5 (Off the shelf) model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset. This is done based on the passed model and corresponding tokenizer.
Source: walkthrough.ipynb
Inputs: val_subset
(DataFrame
), model_final
(Model
), tokenizer_final
(Model Tokenizer
)
Outputs: Evaluates the T5 (fine tuned on Spotify dataset) model on a given validation subset and returns the average F1 score of Rouge across all samples of the val subset. This is done based on the passed model and corresponding tokenizer.
pip install -r requirements.txt
Load extract_data.ipynb
into jupyter
or Google Colab. Execute all cells.
Load content_selection.ipynb
into jupyter
or Google Colab. Execute all cells.
Load t5_training.ipynb
into jupyter
or Google Colab. Execute all cells.
Note: We have followed the tutorial on training the t5 transformer model provided by https://github.com/abhimishra91/transformers-tutorials.
Load walkthrough.ipynb
into jupyter
or Google Colab. Execute all cells.
Each of the above files are accompanied by a corresponding ipynb
notebook file as well. CAs/TAs are requested to use those notebooks for evaluation. They can be imported into a local jupyter instance or Google Colab
The following table outlines our results on a subset of the validation dataset:
Method Name | Average ROUGE F1 Score |
---|---|
TextRank | 0.109 |
T5 (Off the shelf) | 0.144 |
T5 (Finetuned on Spotify Dataset) | 0.312 |
While comparing ROUGE scores, we see that the FineTuned model performs well in comparison to other baselines, which confirms our expectation that domain adaptation on the Spotify dataset is a necessary step towards a higher score.
We used five English speaking volunteers to score the summaries into the defined spectrum of Bad(B) to Excellent(E), as defined by the original paper. This is so that we can effectively capture the subjectivity of how good or bad a summary is, based on how relevant it is to a human evaluator.
Grade | TextRank | T5-Pretrained | T5-FineTuned |
---|---|---|---|
Excellent (E) | 0 | 0 | 0 |
Bad (B) | 0 | 3 | 0 |
Fair (F) | 3 | 1 | 3 |
Good (G) | 2 | 1 | 2 |
- 3 out of 5 people felt that the summaries generated by TextRank and Fine-Tuned were comparable, and rated it Fair(F).
- 1 evaluator felt that TextRank is definitely better, and 1 Evaluator felt that given enough data, FineTuned T5 is a much better abstraction of the podcast transcript.
- 5 out of 5 evaluators agreed that the FineTuned T5 generated better summaries than Off-the-shelf Pretrained T5. This validates our assumption about the need to perform domain adaptation.
- Dataset - Perhaps episode description is not the ideal ground truth to represent summary, since it often contains promotional material which the model learns to generate after every summary, leading to post processing overhead.
- Compute - Since even on the best settings on Colab, the T5 model can only take a certain amount of tokens, perhaps given enough compute T5 has the potential to generate even better summaries. Nevertheless, deep learning based techniques seem to be infeasible for simple use cases.
Dattatreya has been responsible for aquiring and preprocessing the dataset from its raw form to the tabular format for feeding the dataset into the proposed models. This process involved aggregating multiple JSON fragments into a single source of truth. Dattatreya also initiated the process of running TextRank (the intitial method) and conducted sample evaluations to test the code infrastucture end to end.
Arijit has been responsible for picking up on the TextRank and starting the implementation for training T5 transformer. This involved conducting a full scale analysis of using TextRank and experimenting with different off the shelf T5 transformers from HuggingFace which would be most apt for the current context i.e. the Spotify Dataset.
Gargi has been responsible for continuing with the T5 approach and fine tuning the T5 transformer on the spotify dataset in contrast to using an off the shelf version. This also involved tuning hyperparameters of the model (learning rate, epochs etc.). Gargi also prepared the evaluation scripts for evaluating all the three methods on a validation dataset.
Overall, while there was a sequence in the order of tasks picked up by each person, all of the team members contributed equally throughout the ideation process and provided required assistance to fellow teammates when required.