Code Monkey home page Code Monkey logo

star-trek-scripts-nlp-playground's Introduction

Star-Trek-Scripts-NLP-Playground

Project as a training ground for different aspects of NLP

This project is intended as a toy project to practice different aspects of NLP, from collecting and preprocessing data to the options for search and question answering. I was especially interested in trying out the Haystack library. You can use the provided Python scripts to create your own corpus of Memory Alpha plot summaries and afterwards query the corpus. For now, I have focussed more on collecting and cleaning the data instead of using advanced NLP methods. Available query options are thus limited to the Haystack implementation.

Getting Started

In case you want to use the code available here, you can most easily proceed as follows:

  • Use the requirements.txt to setup your VENV with the necessary libraries.
  • Download the Memory Alpha articles via the provided scraping script (data folder: scrape_episodes_articles.py).
  • Use the script extract_plots.py from the data folder to write the plot descriptions from the Wiki articles to text files.
  • Train the bigram model and save a corpus for word2vec via collocation_creation.py.
  • Run word2vec_creation.py to create the word2vec vector embedding model of the corpus and try it out if you want ;).
  • You can now use the combined_search_qa.py script to query the corpus based on TFIDF or DPR methods and ask questions. Have fun - and do not hesitate to give feedback :)

Project structure

This project comes with multiple folders to organize all the files. These folders are:

  • Data
  • Preprocessing

Data

In this folder, you can find the raw data as well as artifacts generated for the different steps of the projects: Serialized DataFrames, collocation models etc. Scripts for collecting additional data can be found here as well, such as scraping episode articles from Wikipedia.

Preprocessing

In this folder, you can find scripts that preprocess the data, either for extracting the text from the original files or for cleaning up text data etc. Additionally, modules that create collocations or word2vec embeddings reside in this folder.

Contact and contribution

Should you want to contribute or to contact me, feel free to send me an email. Live long and prosper!

Data collection and cleaning

This playground currently uses the following data:

  • A dataset with the scripts of the episodes, taken from here: https://www.kaggle.com/gjbroughton/start-trek-scripts For this part of the data, functions that can extract the episode title from the scripts as well as clean up (remove speakers for example) the text are available.

  • A dataset of plot descriptions that I scraped from the Wikipedia articles. The scripts for downloading the files and extracting the actual descriptions of the plots can be found in the data folder. This part of the data was scraped from Wikipedia. The Plot Descriptions of the episodes were extracted/parsed with a simple script.

star-trek-scripts-nlp-playground's People

Contributors

sgrannemann avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.