Code Monkey home page Code Monkey logo

nlp-project's Introduction

nlp-project

This repo contains some files that weren't actually used for the final product. We will highlight the important ones.

Scripts used to collect and pre-process data found in data-scripts. Of importance:

  • data-scripts/scrape_pbp_to_headline.py: used to collect the data in data/rich-pbp-to-headline/raw. With slight modifications (I was dumb and modified in-place instead of creating a copy), can be used to collect the data in data/pbp-to-headline/raw
  • data-scripts/scrape_mgr.py: launches a WorkQueue manager that creates tasks to collect data by month. Used for parallel data collection
  • data-scripts/tokenize_pbp.py: tokenized all pbp data (produced data/pbp-to-headline/tokenized and data/rich-pbp-to-headline/tokenized
  • data-scripts/split_pbp_data.py: splits data deterministically into train (~75%), dev (~10%), and test (~15%). Used to produce the dev, and test files in data/{rich-,}pbp-to-headline/tokenized (train is not included in the repo because it exceeds GitHub's file size).
  • data-scripts/{bbref_scrape_comment.py,html_scrape_day.py}: collected data for the original stat-csv-to-full-recap concept that was abandoned. Included for completeness and posterity.

Python programs used to do nlp sutff in nlp-scripts. Of importance:

  • baseline.py: the baseline model
  • nlp-scripts/hw2_transformer.py: trained the basic transformer
  • nlp-scripts/hw2_transformer_v2.py: trained the expanded and rich play-by-play transformers

Data included in repo:

  • data/pbp-to-headline: play-by-play to headline data collected for training basic and expanded transformers, and fed to baseline to compute results. Includes raw collected data under raw and tokenized data under tokenized. The latter also includes the dev and test data used. Train data is not included because the file size exceeded 100 MB. Entire collection of tokenized data is 190 MB containing 43689 sentence pairs. Structured by year, and then by team, and then by month, and then by day. Includes scraper's error logs so you can know which days were not successful in being collected.
  • data/rich-pbp-to-headline: rich play-by-play to headline data collected for training rich play-by-play transformer. Includes raw collected data under raw and tokenized data under tokenized. The latter also includes the dev and test data used. Train data is not included because the file size exceeded 100 MB. Entire collection of tokenized data is 182 MB containing 40240 sentence pairs. Note that this is slightly smaller because it excludes all postseason data. Structured by year, and then by month. Includes scraper's error logs so you can know which days were not successful in being collected.
  • data/csv-to-recap: statistical csv to full recap data originally collected but later abandoned. Raw and tokenized data both included. Structured by month. Included for completeness and posterity.

Trained models, along with their training output and test headlines, are included under models/.

nlp-project's People

Contributors

tgfisher4 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.