Code Monkey home page Code Monkey logo

toxic_comments's Introduction

Detecting Toxic Comments

Introduction

CNN and LSTM models for text classification. The model is tested on a multi-label classification task with Wikimedia comments dataset. The model achieved an AUROC of 0.896 with randomly initialized word embeddings; using FastText, the AUC is 0.972 with Kim Yoon's CNN, and 0.983 with a stacked LSTM with attention.

Usage

Training

To train with default layer configurations

python training/train.py --data dataset.csv --vocab 30000 --embedding 300 --mode cnn

where vocab flag is for specifying vocabulary size and embedding embedding size; in this example, the real vocabulary size will be 30002 since unknown word and padding word tokens are added. There are three modes: use 'cnn' for training CNN for classification, 'lstm' for training LSTM for classification, 'emb' for training word embeddings, and 'test' for testing a trained model.

To train with a pre trained word vector file, use the 'vector' flag:

python training/train.py --data dataset.csv --vocab 30000 --embedding 300 --mode lstm --vector fasttext.vec

You can also optionally add a tsv metadata file for TensorBoard projector using the metadata flag.

Use Deployed example model trained on Wikimedia dataset

Make requests to the deployed saved model:

python training/client.py --server 35.227.88.30:9000 -d "metadata/word2id.pickle" -t "Enter your potential abusive text here."

Output is a JSON file:

outputs {
  key: "output"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 1
      }
      dim {
        size: 6
      }
    }
    float_val: 1.0
    float_val: 0.0
    float_val: 1.0
    float_val: 0.0
    float_val: 0.0
    float_val: 0.0
  }
}

Each of the six float_vals represents toxic, severe_toxic, obscene, threat, insult, identity_hate.

Custom CNN layers

You can also change the layer configuration if you decide to write your own code for training and testing, by providing values to layer_config and fully_conn_config attributes to the ToxicityCNN object. layer_config is a list and follows the structure:

[
    [
        # Parellel layer 1
        [ksize, stride, out_channels, pool_ksize, pool_stride],
    ],
    [
        # Parellel layer 2
        [ksize, stride, out_channels, pool_ksize, pool_stride],
    ],
]

For Example, a configuration like this:

[
    # Convolution layer configuration
    # ksize, stride, out_channels, pool_ksize, pool_stride
    [
        [2, 1, 256, 59, 1],
    ],
    [
        [3, 1, 256, 58, 1],
    ],
    [
        [4, 1, 256, 57, 1],
    ],
    [
        [5, 1, 256, 56, 1],
    ],
]

represents a structure like this: config

toxic_comments's People

Contributors

walter090 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

toxic_comments's Issues

training on our dataset.?

hi walter , thanks for this amazing work. i am planning to build a multi-class intent classifier and planning to use my fasttext pretrained embeddings.
i want to know what is format of your dataset.csv? how you using labeled dataset?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.