Code Monkey home page Code Monkey logo

spooky-classify's Introduction

Spooky-Classify

Claire Pritchard
January 2018

Text classification with scikit-learn and spaCy. This was used to identify the author and generate predictions for the Kaggle Spooky Author Identification Competition, December, 2017. The dataset consists of text written by Edgar Allan Poe, HP Lovecraft, and Mary Shelley.

The data files can be downloaded from Kaggle. Training data is in train.csv, and the test data set for generating predictions is in test.csv.

As you can see when plotting the distribution of author labels in the training dataset with matplotlib, there are quite a few more samples from Poe than from Lovecraft or Shelley. Rather than trying to find more Lovecraft and Shelley samples, I chose to resample using imbalanced-learn. Author distribution

The model I finally arrived at is a VotingClassifier using as estimators the three classifiers with predict_proba support that had the highest accuracy. The VotingClassifier performed slightly better than the individual models, which were MultinomialNB, BernoulliNB, and MLPClassifier.

Accuracy was also improved slightly by the addition of a few new features: sentence length and standard deviation of the lengths of the words in the sentence. The sentences were tokenized using spaCy.

After fitting the model, I got a score of 0.9988 on the training data and 0.8652 on the data held out for testing. Making predictions for the held out test data resulted in the following classification report and confusion matrix:

precision recall f1-score support
EAP 0.83 0.92 0.87 1999
HPL 0.92 0.80 0.86 1388
MWS 0.86 0.86 0.86 1508
avg / total 0.87 0.87 0.86 4895

Confusion matrix

spooky-classify's People

Contributors

clairempr avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

dharunyav

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.