Code Monkey home page Code Monkey logo

data-scripts-star-wars's Introduction

Scripts from Star Wars movies

Table Of Contents

Overview

All the complete scripts of the first 6 Star Wars movies, as well as a light version of each containing only the dialogues and the places where they took place.

Likewise, you will find the images of each of the characters who had an interaction in the films.

In the folder data , are the CSV files of each film with information on the dialogues such as: the number of words per interaction, the types of words, the duration of the interaction, who speaks to whom and the location.

If you like the work then do not hesitate to visit my site.

Why this repo

This project was initially a personal project but with the work accomplished, I find it important to share with the community.

The idea for the Star Wars project came to me after a conference on data visualization at the KIKK festival in Namur in Belgium. The speaker (Nadieh Bremer) made me want to create a data-visualization and what could be better than the theme of Star Wars. My data-visualization will be visible later.

At first, I recovered the scripts of the first 6 films. I laid them out in markdown files in order to keep only the dialogues, the speakers and the places. I accompanied the script files with one file per film each time containing all the characters.

In the continuity of my work, I encoded the markdown files in HTML so that I could automatically extract the data I wanted with Javascript. At the same time, I created a small script that counted the words.

The second part of my work consisted of watching all the films and checking that everything was correct in terms of scripts.

This done, I encoded each film in a numbers file with several data including among others : the speaker, the interlocutor, the content, the duration, the place, the number of words, the type of words, etc. Thanks to the subtitle files, I was able to recover the duration of the talks and check all the data a second time.

Progression in the project

A small overview of the progress of the project.

  • Recover script files (3th January 2020)
  • Transcription and cleaning in markdown (8th January 2020)
  • Adding data in the sheet (8th January 2020)
  • Adding listeners
  • Adding durations
  • Adding sorts of words
  • Creating CSV files for each movie
  • Creating JSON files for each movie

What can you find here

In this repo, you can find several files about the Star Wars univers.

Folder Description
๐Ÿ“‚ Sources All source files that were used to collect the data
๐Ÿ“‚ Markdown Markdown files with dialogs, speakers and location
๐Ÿ“‚ JSON for sheet JSON files format to populate the Sheet file
๐Ÿ“‚ Data sheet Sheet file which gathers all the information for each film
๐Ÿ“‚ Data CSV CSV file ready to use for each film

Some code

My count words function used for the project :

function countWords(s){
    s = s.replace(/(^\s*)|(\s*$)/gi,"")
    s = s.replace(/[ ]{2,}/gi," ")
    s = s.replace(/[...]/gi," ")
    s = s.replace(/[(]+.+[)]/gi," ")
    s = s.replace(/\n /,"\n")
    return s.split(' ').filter(function(str){return str!="";}).length
}

Code to make the json

const uls = document.querySelectorAll('ul')
var global = []
var wordsGlobal = []
let where = null

let timePerWords = Math.round(83672 / 9595);

uls.forEach(ul => {
    let previousEl = ul.previousElementSibling
    let lis = [...ul.getElementsByTagName('li')]

    if (previousEl !== null) {
        if (previousEl.nodeName === "P") {
            where = previousEl.innerText
        }
    }

    lis.forEach(li => {
        let content = li.innerText.split(" : ")
        let number = 0
        let text = content[1]
        let textFormat = null
        let peoples = content[0].split(' to ')

        if (typeof text !== 'undefined') {
            textFormat = formatSentence(text)
            number = countWords(textFormat)
        }

        global.push({
            "from": peoples[0],
            "to": peoples[1],
            "text": text,
            "where": where,
            "number": number,
            "time": number * timePerWords
        })
    })
})

let data = JSON.stringify(global)

My PHP file to get the total speech time on screen based on the SRT

  $data = file_get_contents('./srt.txt', false);
  $res = preg_replace("/\*([0-9])\*/", "", $data);
  $res2 = preg_replace("/[^0-9:,>]/", " ", $res);
  $res3 = preg_replace('!\s+!', ' ', $res2);
  $res4 = preg_replace('/ , /', ' ', $res3);
  $res5 = preg_replace('/ > /', '>', $res4);
  $res6 = preg_replace('!\s+!', ' ', $res5);
  $res7 = explode(" ", $res6);

  $minutesGlobal = 0;
  $msGlobal = 0;

  foreach($res7 as $part) {
    $explode = explode('>', $part);

    if (sizeof($explode) == 2) {

      $explode[0] = preg_replace("/,/", ":", $explode[0]);
      $explodeNumbers = explode(':',$explode[0]);
      $explode[1] = preg_replace("/,/", ":", $explode[1]);
      $explodeNumbers2 = explode(':',$explode[1]);

      $ms = (int)$explodeNumbers2[3] - (int)$explodeNumbers[3];
      $secondes = (int)$explodeNumbers2[2] - (int)$explodeNumbers[2];
      $minutes = (int)$explodeNumbers2[1] - (int)$explodeNumbers[1];
      $hours = (int)$explodeNumbers2[0] - (int)$explodeNumbers[0];

      $minutesGlobal += $minutes;
      $msGlobal += $ms;
    }
  }

  echo $minutesGlobal . ' ' . $msGlobal;

Credits

The storyline, characters and images represented in this repo belong to the respective owners

Links Description
DISNEY All content belong to Disney
IMSDB Scripts from movies
YIFY Subtitles files
MARKDOWN TO HTML Convert markdown to html
JSON PARSER Parser the JSON
CSV-JSON Convert JSON to CSV
WORDCOUNTER Count word if needed
REGEXR For Regular Expressions
WORDPOS Part-of-speech for type of words

data-scripts-star-wars's People

Contributors

jcwieme avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.