Light

michael0x2a / nlp-capstone Goto Github PK

View Code? Open in Web Editor NEW

1.0 4.0 0.0 306.12 MB

Python 97.47% JavaScript 0.94% CSS 0.30% HTML 1.29%

nlp-capstone's Introduction

nlp-capstone

Setup

This project is designed to use Python 3.5, specifically (Tensorflow isn't yet compatible with Python 3.6+).

Installation:

python3.5 -m pip install -r requirements.txt

Note: when running code, you should be within the abuse folder.

Useful utilities:

To run the cmd line tool:

Run:

python3.5 cmd.py [dataset-name] [dataset-params] [model-name] [model-parms]

The params will be forwarded directly as arguments into the dataset and model names respectively.

The params must be in the form:

--param_name arg

The types of the args are automatically inferred.

So for example, to run the RNN model using the wikipedia dataset (specifically, the toxicity dataset), setting the number of epoches to 7 and all other params the same as the default, you would run:

python3.5 cmd.py wikipedia --category toxicity rnn --epoch 7

Regenerating json data caches

To regenerate the cached data json files for a particular data type:

python3.5 -m data_extraction.[dataset_name].parsing

For example, to regenerate the wikipedia data, run:

python3.5 -m data_extraction.wikipedia.parsing

Typechecking

Run:

mypy [path-to-file.py]

To typecheck the entire project, run:

mypy ../abuse

...which is a bit of a hack, but whatever.

nlp-capstone's People

Contributors

Stargazers

Watchers

nlp-capstone's Issues

Implement checkpoints (or something) for models

So we can train, test, then continue re-training.

Plan what questions we want to ask Yejin during office hours Tuesday

Some possible questions:

Does it feel like we're making good progress/that we're focused enough?
Suggestions on more things we can analyze? Right now, we're sort of exploring a bunch of things without necessarily a clear endgoal?

Do we also have any technical questions we can ask?

Train on proportion of people, rather then hard binary labels

Might require some refactoring.

Make slides for demo 1

Identify specific phrases that are attacks, aggressive, toxic

Approach as reading comprehension problem?

Implement model saving and loading

Implement code to save and load trained model. Probably dependent on #3

Bag-of-words model
Logistic regression model
RNN model

Tasks 2 and 3 can probably be combined, since they both use tensorflow?

Get access to GPU instances on Azure

Make account + project on Azure
Email Hannah account email to get credits
Provision instance and install CUDA + other Nvidia stuff + Tensorflow

This probably isn't super crucial -- I think we both have access to GPUs atm, so we can afford to put off at least the installation phase for a few days, since that's tedious and finicky.

Question: are the annotators biased?

For the sake of due diligence, we should check if the annotators for the wikipedia dataset are noticeably biased in some way.

This data may already be in the wikipedia paper; we may need to collect it ourselves.

We should probably make sure to have a short blurb about this in our final report.

Refactor models so they return probabilities rather than hard labels

Evaluate how quickly classifiers can predict new input

This is more of an engineering concern, but if this were to become an actual product, it'd probably be useful to make sure the classifier is performant/can handle a reasonable load while running on a variety of different hardware.

It might be worth conducting experiments/explicitly trying to optimize the speed of our classifiers/other tensorflow stuff?

Improve RNN performance

Currently, the RNN model is hovering around 0.8 to 0.83 AUC score despite doing a bunch of tweaking. Things to try:

Modify cmd so it works on multiclass input/output

Add code quality/typechecking code (and document usage)

Not a huge priority, but it'd be cool to add tooling that...

Checks for PEP 8 compliance
Runs mypy (to typecheck code)

We should also:

Document usage in readme

Implement a basic web interface for demo purposes

This web interface should...

Accept arbitrary text as input
Report the output of our models on different metrics
Make sure we can run multiple tensorflow models

It'd be cool if we can have something basic working for our Monday demo, then add more features later for our final demo.

Write code to load other datasets

Load wikipedia toxicity dataset
Load wikipedia aggressiveness dataset
Figure out if all of the comments in all datasets are the "same", and see if we can't store them all in the 'comment' object.
Cache loaded data from each dataset

Rewrite aggressive text to be less/more aggressive/toxic

Treat as machine translation problem?

It'd be cool if we could have some sort of slider to scale the level of "aggressiveness/toxicity"?

TODO:

Research approach
Add more TODOs

Refactor models so they all follow the same interface

We'll probably follow approximately the same interface as the character n-gram model.

Refactor RNN model
Refactor bag-of-words model
Refactor character n-gram model
Make base class or abc to help code typecheck
Make sure all models can accept multiple classes?
Strip out all dataset-specific code

Get character n-gram model working

Fix padding issues with RNN model

Ideas to explore:

Apply a mask before computing loss
Look into using dynamically-resizing RNNs, and having the width vary per batch

Potential issues:

Comment width can vary widely; how do we handle outliers?

Write command line interface to let us run models

The command line tool should let us...

Pick which model we want to use
Pick which dataset we want to analyze
Provide model-specific options
Choose between re-training or using an already-trained model
Lets us chose if we want to clear any caches/logs or reuse them (for tensorflow)
Lets us chose the output location of any data/logs

Probably dependent on #3.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.