Code Monkey home page Code Monkey logo

offensive_text's Introduction

offensive_text

Data normalization

This project depends on https://pypi.org/project/emoji/ and https://pypi.org/project/clean-text/ for the preprocessing of the data.

Normalise the list of files; train and test:

python3 read_data.py 
                     --to_normalise LIST_OF_FILES
                     --normalised_path LIST_OF_NORMALISED_PATHS_IN_ORDER
                     --language en

Concatenate files:

python3 read_data.py 
                     --normalised_path LIST_OF_NORMALISED_PATHS
                     --concat CONCATENATED_PATH

Run models

For this project we used the following BERT models:

The optimal number of epochs for each model were the following:

  • EN: 2,
  • EN-multi: 2,
  • DE: 1,
  • DE-multi: 3,
  • DE-HASOC: 2,
  • DE-HASOC-multi: 8,
  • GermEval: 6,
  • GermEval-multi: 4

Change the data paths in the config files to direct to the concatenated normalised path and the test.json to direct to the normalised test files.

Important note: if the data_path parameter is a single file, it will be split into train and validation. If you want the model to train on the whole data, you have to make it the first element of a list, like you can see in the configs/English_train_whole_data.json file.

python3 run_configs.py 
                     --mode train
                     --configs ./configs 
                     --test_files test.json

After the training process finished, the best systems will give a prediction on the given test files. These will be put into the predicted dictionary. The training process is the same for the categorical subtask.

Run the following script to get the final result:

python3 run_configs.py 
                     --mode result
                     --configs ./configs 
                     --test_files test.json

If the test file does not contain labels, add the --test argument to the above command.

Rule system

Find the rule systems under scripts/rule_system

Results on the samples

Test System TP TN FP FN Precision Recall F1
EN EN-all 64 23 11 2 85.3 97.0 90.8
EN DE-all-multi 12 32 2 54 85.7 18.2 30.0
EN Rules 32 32 2 34 94.1 48.5 64.0
EN EN-all $\cup$ Rules 64 22 12 2 84.2 97.0 90.1
EN DE-all-multi $\cup$ Rules 35 30 4 31 89.7 53.0 66.7
EN EN-all $\cup$ DE-all-multi 64 22 12 2 84.2 97.0 90.1
EN EN-all $\cup$ DE-all-multi $\cup$ Rules 64 21 13 2 83.1 97.0 89.5
DE DE-all 12 62 5 21 70.6 36.4 48.0
DE EN-all-multi 10 63 4 23 71.4 30.3 42.6
DE Rules 4 66 1 29 80.0 12.1 21.1
DE DE-all $\cup$ Rules 13 61 6 20 68.4 39.4 50.0
DE EN-all-multi $\cup$ Rules 12 62 5 21 70.6 36.4 48.0
DE DE-all $\cup$ EN-all-multi 15 58 9 18 62.5 45.5 52.6
DE DE-all $\cup$ EN-all-multi $\cup$ Rules 16 57 10 17 61.5 48.5 54.2

offensive_text's People

Contributors

gkinga avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.