The toxic_comments from walter090

Detecting Toxic Comments

Introduction

CNN and LSTM models for text classification. The model is tested on a multi-label classification task with Wikimedia comments dataset. The model achieved an AUROC of 0.896 with randomly initialized word embeddings; using FastText, the AUC is 0.972 with Kim Yoon's CNN, and 0.983 with a stacked LSTM with attention.

Usage

Training

To train with default layer configurations

python training/train.py --data dataset.csv --vocab 30000 --embedding 300 --mode cnn

where vocab flag is for specifying vocabulary size and embedding embedding size; in this example, the real vocabulary size will be 30002 since unknown word and padding word tokens are added. There are three modes: use 'cnn' for training CNN for classification, 'lstm' for training LSTM for classification, 'emb' for training word embeddings, and 'test' for testing a trained model.

To train with a pre trained word vector file, use the 'vector' flag:

python training/train.py --data dataset.csv --vocab 30000 --embedding 300 --mode lstm --vector fasttext.vec

You can also optionally add a tsv metadata file for TensorBoard projector using the metadata flag.

Use Deployed example model trained on Wikimedia dataset

Make requests to the deployed saved model:

python training/client.py --server 35.227.88.30:9000 -d "metadata/word2id.pickle" -t "Enter your potential abusive text here."

Output is a JSON file:

outputs {
  key: "output"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 1
      }
      dim {
        size: 6
      }
    }
    float_val: 1.0
    float_val: 0.0
    float_val: 1.0
    float_val: 0.0
    float_val: 0.0
    float_val: 0.0
  }
}

Each of the six float_vals represents toxic, severe_toxic, obscene, threat, insult, identity_hate.

Custom CNN layers

You can also change the layer configuration if you decide to write your own code for training and testing, by providing values to layer_config and fully_conn_config attributes to the ToxicityCNN object. layer_config is a list and follows the structure:

[
    [
        # Parellel layer 1
        [ksize, stride, out_channels, pool_ksize, pool_stride],
    ],
    [
        # Parellel layer 2
        [ksize, stride, out_channels, pool_ksize, pool_stride],
    ],
]

For Example, a configuration like this:

[
    # Convolution layer configuration
    # ksize, stride, out_channels, pool_ksize, pool_stride
    [
        [2, 1, 256, 59, 1],
    ],
    [
        [3, 1, 256, 58, 1],
    ],
    [
        [4, 1, 256, 57, 1],
    ],
    [
        [5, 1, 256, 56, 1],
    ],
]

represents a structure like this:

walter090 / toxic_comments Goto Github PK

toxic_comments's Introduction

Detecting Toxic Comments

Introduction

Usage

Training

Use Deployed example model trained on Wikimedia dataset

Custom CNN layers

toxic_comments's People

Contributors

Stargazers

Watchers

Forkers

toxic_comments's Issues

training on our dataset.?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent