Code Monkey home page Code Monkey logo

nlp-capstone's Introduction

nlp-capstone

Setup

This project is designed to use Python 3.5, specifically (Tensorflow isn't yet compatible with Python 3.6+).

Installation:

python3.5 -m pip install -r requirements.txt

Note: when running code, you should be within the abuse folder.

Useful utilities:

To run the cmd line tool:

Run:

python3.5 cmd.py [dataset-name] [dataset-params] [model-name] [model-parms]

The params will be forwarded directly as arguments into the dataset and model names respectively.

The params must be in the form:

--param_name arg

The types of the args are automatically inferred.

So for example, to run the RNN model using the wikipedia dataset (specifically, the toxicity dataset), setting the number of epoches to 7 and all other params the same as the default, you would run:

python3.5 cmd.py wikipedia --category toxicity rnn --epoch 7

Regenerating json data caches

To regenerate the cached data json files for a particular data type:

python3.5 -m data_extraction.[dataset_name].parsing

For example, to regenerate the wikipedia data, run:

python3.5 -m data_extraction.wikipedia.parsing

Typechecking

Run:

mypy [path-to-file.py]

To typecheck the entire project, run:

mypy ../abuse

...which is a bit of a hack, but whatever.

nlp-capstone's People

Contributors

briankchan avatar michael0x2a avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

nlp-capstone's Issues

Plan what questions we want to ask Yejin during office hours Tuesday

Some possible questions:

  • Does it feel like we're making good progress/that we're focused enough?
  • Suggestions on more things we can analyze? Right now, we're sort of exploring a bunch of things without necessarily a clear endgoal?

Do we also have any technical questions we can ask?

Implement model saving and loading

Implement code to save and load trained model. Probably dependent on #3

  • Bag-of-words model
  • Logistic regression model
  • RNN model

Tasks 2 and 3 can probably be combined, since they both use tensorflow?

Get access to GPU instances on Azure

  • Make account + project on Azure
  • Email Hannah account email to get credits
  • Provision instance and install CUDA + other Nvidia stuff + Tensorflow

This probably isn't super crucial -- I think we both have access to GPUs atm, so we can afford to put off at least the installation phase for a few days, since that's tedious and finicky.

Question: are the annotators biased?

For the sake of due diligence, we should check if the annotators for the wikipedia dataset are noticeably biased in some way.

This data may already be in the wikipedia paper; we may need to collect it ourselves.

We should probably make sure to have a short blurb about this in our final report.

Evaluate how quickly classifiers can predict new input

This is more of an engineering concern, but if this were to become an actual product, it'd probably be useful to make sure the classifier is performant/can handle a reasonable load while running on a variety of different hardware.

It might be worth conducting experiments/explicitly trying to optimize the speed of our classifiers/other tensorflow stuff?

Improve RNN performance

Currently, the RNN model is hovering around 0.8 to 0.83 AUC score despite doing a bunch of tweaking. Things to try:

  • Scramble data before each epoch
  • Try porting RNN to a character RNN
  • Try using a static RNN instead of a bidirectional RNN
  • Vary different parameters by hand
  • Try stacking multiple RNNs
  • Try using variants of different dynamic RNNs
  • Try using an actual grid search thing. It'll probably take hours to run, but whatever.
  • Try using optimizers other then Adam (SGD? Momentum? AdaDelta?)
  • Try varying learning rate
  • Dig into misclassified results to get better insights into what the problem is

Implement a basic web interface for demo purposes

This web interface should...

  • Accept arbitrary text as input
  • Report the output of our models on different metrics
  • Make sure we can run multiple tensorflow models

It'd be cool if we can have something basic working for our Monday demo, then add more features later for our final demo.

Write code to load other datasets

  • Load wikipedia toxicity dataset
  • Load wikipedia aggressiveness dataset
  • Figure out if all of the comments in all datasets are the "same", and see if we can't store them all in the 'comment' object.
  • Cache loaded data from each dataset

Refactor models so they all follow the same interface

We'll probably follow approximately the same interface as the character n-gram model.

  • Refactor RNN model
  • Refactor bag-of-words model
  • Refactor character n-gram model
  • Make base class or abc to help code typecheck
  • Make sure all models can accept multiple classes?
  • Strip out all dataset-specific code

Fix padding issues with RNN model

Ideas to explore:

  • Apply a mask before computing loss
  • Look into using dynamically-resizing RNNs, and having the width vary per batch

Potential issues:

  • Comment width can vary widely; how do we handle outliers?

Write command line interface to let us run models

The command line tool should let us...

  • Pick which model we want to use
  • Pick which dataset we want to analyze
  • Provide model-specific options
  • Choose between re-training or using an already-trained model
  • Lets us chose if we want to clear any caches/logs or reuse them (for tensorflow)
  • Lets us chose the output location of any data/logs

Probably dependent on #3.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.