Code Monkey home page Code Monkey logo

tbbtcorpus's Introduction

TBBTCorpus

The Big Bang Theory Transcript Corpus

Downloading Corpus

We used fan-sourced transcripts for The Big Bang Theory as data set for our experiment. The Blog site had transcripts for the 9 season ( 220 scenes ) of The Big Bang Theory, categorized by season and episode. One of the first task was to extract transcripts from these webpages.

Scripts

We manually constructed a list of links of all URL's. Using this list content was scraped from the web and only relevant text was extracted and dumped into text format under corpus/raw_corpus directory.

You can run util.py to reproduce this corpus. By default it uses the file episode_links.json for list the links.

python util.py episode_links.json 

Requires lxml and requests python module

You can run preprocessing.py to perform preprocessing on the data.This script requires corpus.json

python preprocessing.py

The previous step of util.py creates a corpus.json which is JSON representation of season /episode categorized transcript

Processing Corpus

A single episode transcript is of the form

[Scene]
SpeakerA : ***Some Text***
SpeakerB : ***Some Text***
SpeakerC : ***Some Text***
...
...
...
[Scene]
SpeakerA : ***Some Text***
SpeakerB : ***Some Text***
SpeakerC : ***Some Text***
...
...
...

Challenges with processing

  • We treat every Scene as distinct unit to be processed. Theoretically, a scene change occurs even when new characters enter or exit. However, for our corpus, there were only 13 turns that had enters and 18 that had exits of the total 1028000 turns and hence we decided ignore splitting the scene when a character enters or exits.

    There were instances when they made sense for e.g

    Penny (as Sheldon enters): Shh! Act normal.

    While instances when they had no relevance for e.g.

    Leonard (entering on the phone): Iโ€™m really very busy. Is there any way that we can put this off

  • Another challenge were instances where extra attributes being mentioned along with speaker information. This extra information needed to removed to capture the speaker name. We could have used Named Entity Recognition for this, but we used the naive approach to parse for brackets and ignore any content within the brackets.This has worked fine for us.

    for e.g.

    Speaker: Penny( Sitting quitely):

    resulted in capturing speaker as Penny

  • As not all speakers had enough dialogues, to identify the character, we needed to define the subset of characters to be considered as classification label. We wrote additional script to analyze the # of dialogues by each character. We chose a threshold of 4 to decide the main classification labels. We included an extra label for all other characters clubbed together as "Others". Hence each dialogue can be classified to be spoken by one of the five characters in our case ["Leonard","Sheldon","Penny", "Howard","Others"]

    Character Dialogues
    Sheldon 164415
    Leonard 94143
    Penny 73147
    Howard 62951
    Raj 50924
    Amy 32641

For each scene we capture

  • Scene Description
  • Season_Episode
  • List of Turns (Turn Object)
  • List of Participants ( Names of Characters present in the turn)

For each turn (Turn Object) we capture

  • Speaker
  • Recipients
  • List of words(after removing stop words) in the turn (utterance)
  • POS tag for each word
  • Topic associated with the turn (Topic extracted using LDA model)
  • ACT Tag

We save list of scenes per episode in a JSON format, where each episode in a season has a key of season_episode. Additionally, it also generated a JSON file for the unique words over the entire corpus with key as the lemmatized word and value as occurence frequency.This can be useful for topic extraction.

We use nltk toolkit to remove stop words and

Please see corpus reader tool, train and classify methods & evaluation methods for more information on the next steps

tbbtcorpus's People

Contributors

skashyap7 avatar chethanv28 avatar wenhaoz-fengcai avatar rakh7 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.