kya-allen / text-classification---fanfic-pov Goto Github PK

View Code? Open in Web Editor NEW

Using Statistical/Machine Learning techniques and unstructured text data from Archive of Our Own (Ao3) to build a Point of view classifier. 93% accuracy so far

Jupyter Notebook 100.00% Roff 0.01%

text-classification---fanfic-pov's Introduction

Text Classification - Fanfic POV

Classifying Fanfiction from Archive of Our Own (Ao3) as either First, Second, or Third Person using Statistical/Machine Learning Algorithms

Current Content:

Notebook on Data Extraction
- Used Archivist.py to scrape Work Id's and first chapter tests from Ao3
Notebook on Cleaning & Feature Engineering
- Removed html artifacts from text.
- Tokenized by words and by sentences.
- Created features based on frequency ratios for key words that 1. appear anywhere 2. begin a sentence
Notebook on training ML models
- Logistic Regression: 86% Accuracy
- K Nearest Neighbors: 93% Accuracy

Planned Directions:

Implement more models:
- Random Forest
- Boosting
Productionize model for a browser extention for Ao3 users
Investigate Features that might allow the model to differentiate between Third Person Subtypes (i.e. Limited, Omniscient)

Recommend Projects