Code Monkey home page Code Monkey logo

harry-potter-data's Introduction

1) About

This repository contains various data files that can be used to perform a text analysis of Harry Potter books, written by Joanne Kathleen Rowling:

  1. Harry Potter and the Philosopher’s Stone

  2. Harry Potter and the Chamber of Secrets

  3. Harry Potter and the Prisoner of Azkaban

  4. Harry Potter and the Goblet of Fire

  5. Harry Potter and the Order of the Phoenix

  6. Harry Potter and the Half-Blood Prince

  7. Harry Potter and the Deathly Hallows

To perform the text analysis, we recommend using tidyverse tools (see packages below) and getting inspiration from the book Text Mining with R: A Tidy Approach (by Silge & Robinson):

library(tidyverse)
library(tidytext)

2) Content

The content of this repo is divided in three directories, each one containing different types of files.

  • csv-data-file/ contains the text of all Harry Potter books in a single CSV file.

  • rda-data-files/ contains the seven Harry Potter books stored in R-Data (binary) files—one file per book.

  • sentiment-lexicons/ contains a handful of sentiment lexicons, also stored in R-Data (binary) files—one file per lexicon.


2.1) Harry Potter CSV file

The data of all the books is available in csv format—in a single file: harry_potter_books.csv.

Assuming that this file is in your working directory, you can import it—via tidyverse’s readr()—as follows:

# requires package tidyverse
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")

This data set is fairly simple—in terms of its structure—although the text content is far from being tidy. The dataset has 95085 rows and 3 columns:

  1. text: text content

  2. book: title of associated book

  3. chapter: associated chapter number


2.2) Harry Potter R-Data Files

The data of each book is also available in its own R-Data rda file (see rda-data-files/):

  • "philosophers_stone.rda"
  • "chamber_of_secrets.rda"
  • "prisoner_of_azkaban.rda"
  • "goblet_of_fire.rda"
  • "order_of_the_phoenix.rda"
  • "half_blood_prince.rda"
  • "deathly_hallows.rda"

These files come from the R package "harrypotter" by Bradley Boehmke

https://github.com/bradleyboehmke/harrypotter

To import these files use the load() function. For example, consider the first book “Harry Potter and the Philosopher’s Stone”; here’s how to load() it in R:

# assuming that the rda file is in your working directory
load("philosophers_stone.rda")

Assuming that "philosophers_stone.rda" has been loaded, the text of this book is available in the homonym character vector philosophers_stone

# text is in a character vector
# (with as many elements as chapters in the book)
length(philosophers_stone)
#> [1] 17

The number of elements in philosophers_stone corresponds to the number of chapters in this book: 17 chapters.

You may want to use these files to perform bigram analysis (or other type of n-gram analysis).


2.3) Sentiment Lexicons

In addition to the Harry Potter text, you can also find data for a handful of sentiment lexicons from the R package "textdata" (by Hvitfeldt and Silge):

  • "bing": Bing Liu’s General purpose English sentiment lexicon that categorizes words in a binary fashion, either positive or negative

  • "afinn": AFINN is a lexicon of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011.

  • "nrc": General purpose English sentiment/emotion lexicon. This lexicon labels words with six possible sentiments or emotions: “negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, or “trust”. The annotations were manually done through Amazon’s Mechanical Turk.

  • "loughran": English sentiment lexicon created for use with financial documents. This lexicon labels words with six possible sentiments important in financial contexts: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”.

These lexicons come in rda data files (see sentiment-lexicons/):

  • bing.rda
  • afinn.rda
  • bing.rda
  • loughran.rda

To import them in R, use the load() function. For example, here’s how to import the Bing lexicon:

# assuming that the rda files are in your working directory
load("bing.rda")

Assuming that you’ve loaded the file "bing.rda", the associated lexicon is available in the homonym tibble bing

bing
#> # A tibble: 6,786 × 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 2-faces     negative 
#>  2 abnormal    negative 
#>  3 abolish     negative 
#>  4 abominable  negative 
#>  5 abominably  negative 
#>  6 abominate   negative 
#>  7 abomination negative 
#>  8 abort       negative 
#>  9 aborted     negative 
#> 10 aborts      negative 
#> # ℹ 6,776 more rows

harry-potter-data's People

Contributors

gastonstat avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.