Code Monkey home page Code Monkey logo

karen's Introduction

KAREN: Unifying Hatespeech Detection and Benchmarking

This project started as a course project for the 2021 Natural Language Processing course at Tsinghua University and is still a work in progress. Our final project report is available in report.pdf Contributions are accepted for further work.

Introduction

Hate speech, also known as offensive or abusive language, is defined as “any form of communication that disparages a person or group on the basis of some characteristic such as race, color ethnicity, gender, sexual orientation, nationality, religion or other characteristic” (Nockleby, 2000). Nowadays, thanks to the availability of the internet and the emergence of social media, people have the tools necessary to express their opinions online. This of course includes the widespread dissemination of hate speech. Such speech has the potential of causing severe psychological consequences to individuals, or potentially promote verbal or even physical violence against a group. Due to these unwanted consequences, both the industry and academia have been working hard to develop techniques that can accurately detect such forms of hate. Such solutions, however, are not unified. Most research proposes a solution together with their own dataset and evaluates only on this dataset. This suffers from several problems.

Firstly, bias. Due to cultural differences and even just different points of view between different individuals, perception of hate speech varies and is very subjective, which will result in some datasets being especially biased on way or another.

Secondly, dataset incompatibility. It is common for some recent models to make use of metadata which can help improve results with the help of some background information, and this will often lead to a low compatibility between models and datasets.

Overall, it is hard to specify what is the current state of the art and what are the most promising research directions. Very few models can be directly compared as they are trained on different datasets.

To combat these issues we propose KAREN, a framework that intends to unify this research area. Our contribution provides an easy to use system that unifies the testing platform and can be easily utilised by beginners and researchers at the forefront of the field alike. It eases the design of data pre-processing and model implementation, allowing researchers to compare models themselves on their machines, or to contribute with their own datasets, meaning it is easily to get results on new research, compare with other baselines and test the durability of different models in different environments.

Running

To run the framework, you just need to run the run.py file available at the root of the repository. To get started simply run:

python3 run.py --model softmaxregression --dataset hatexplain --dropout 0.15 --max-epochs 5

You can check the parameters of each model in its file or by checking the initial configuration when running it.

Contributing

You can contribute to the framework by adding models and datasets that fit the format of the framework. Please note that for simplification, we assumed this task as being a multi-class classification, so the model must output probabilities of out_feat size which will then be passed to a softmax function.

Models

All implemented models must extend the superclass BaseModel in framework/models/base_model.py and override its methods (which will be used in the remaining training and testing scripts. You can see an example of a Softmax classification in framework/models/softmax_regression.py.

If your model requires specific arguments, you can request them from the parser using the add_required_arguments(parser) method. At the moment, if you run multiple models with the same requirements it will not run. You should also create a make_model function that picks up the arguments from the parser and extracts the one your model needs.

After implementing your model, you can add it to the framework by adding the @RegisterModel decorator. This will make sure the framework can find your model.

You'll also need to add an import in framework/models/__init__.py

Note: different models make use of different data and this framework intends to provide a unified way of testing them and easing implementation. There is a collection of requirements for each model to run that must be containted within the dataset. Please make sure that you're not repeating words, typos or writing them in a different way. You can check the available features of a dataset by checking their data_requirements() method.

Available arguments

When developing a model, some extra arguments are always available for selection. Currently, the list is the following:

  • in_feat
  • out_feat
  • vocab_size
  • device

The make_model function should refrain from using any others than this list and the arguments specified on add_arguments of itself.

Datasets

Datasets are implemented similar to models. You must extend BaseDataset from the file framework/datasets/base_dataset.py and implemented the required logic. framework/datasets/hatexplain.py provides an example on how to implement a dataset with lazy preprocessing.

For registering datasets, you must use the @RegisterDataset decorator and add the import in the framework/datasets/__init__.py. All the remaining logic is the same as for the models.

Results

The results are available in results.md

karen's People

Contributors

clairecyq avatar spkgyk avatar tiagomantunes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

karen's Issues

Add missing references

A few models are missing references in their model headers. This reference should follow the pattern as in AngryBERT

Models missing reference:

  • CharCNN
  • NetLSTM

Fix CharCNN evaluation

This model is designed to run on characters but it's currently being evaluated on word tokens.

Solution:

  • Add character embeddings to the datasets and run CharCNN in them

Next steps

The framework now provides an easy to use interface that speeds up model and dataset evaluation. Some things can still be improved.

Overall, we need more models. If you have any model that you would like see implemented, feel free to submit a pull request.

For the future, a few functionalities are desired:

  • Support more embeddings
  • Compatibility is very easy to break now. It would be preferable to have a configuration file with the available existing classifications
  • Add support for custom training. For example, some models would prefer a sparse adam or even support secondary tasks
  • More things can be added to the toolkit. Any suggestions?

If you have any suggestions, comment them below.
Thank you!

Deterministic computation

Currently this is how we handle the seeds for the computation, which is how it's stated in pytorch/pytorch#7068

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
np.random.seed(seed)  # Numpy module.
random.seed(seed)  # Python random module.
torch.manual_seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
import os
os.environ['PYTHONHASHSEED'] = str(seed)

But when running twice a simple SoftmaxRegression model, we get different output

tiagoantunes:hatespeech/ (master) $ python3 run.py --model softmaxregression --dataset HATexplAin --max-epochs 1 --batch-size 64                                                                                                                              [20:07:09]
******************************  CONFIGURATION  ******************************
batch_size                              64
cpu                                     False
dataset                                 ['hatexplain']
dropout                                 0.1
embedding_dim                           200
embeddings                              None
lr                                      0.001
max_epochs                              1
model                                   ['softmaxregression']
savename_hatexplain                     HateXPlain.dataset
seed                                    12345
url_hatexplain                          https://raw.githubusercontent.com/hate-alert/HateXplain/master/Data/dataset.json
***************************************************************************** 

Preprocessing HateXPlain

Starting training of (Model=softmaxregression Dataset=hatexplain)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 255/255 [00:00<00:00, 309.05it/s, loss=972]
Epoch #1 validation accuracy = 0.294789
Accuracy increased from 0 to 0.29478907585144043, saving model.

Test accuracy: 0.2730281352996826
tiagoantunes:hatespeech/ (master) $ python3 run.py --model softmaxregression --dataset HATexplAin --max-epochs 1 --batch-size 64                                                                                                                              [20:07:19]
******************************  CONFIGURATION  ******************************
batch_size                              64
cpu                                     False
dataset                                 ['hatexplain']
dropout                                 0.1
embedding_dim                           200
embeddings                              None
lr                                      0.001
max_epochs                              1
model                                   ['softmaxregression']
savename_hatexplain                     HateXPlain.dataset
seed                                    12345
url_hatexplain                          https://raw.githubusercontent.com/hate-alert/HateXplain/master/Data/dataset.json
***************************************************************************** 

Preprocessing HateXPlain

Starting training of (Model=softmaxregression Dataset=hatexplain)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 255/255 [00:00<00:00, 287.52it/s, loss=979]
Epoch #1 validation accuracy = 0.300744
Accuracy increased from 0 to 0.3007444143295288, saving model.

Test accuracy: 0.2961941659450531

I have tried using torch.use_deterministic_algorithms(True) but with no success either.

I haven't been able to find a fix. Solutions/Suggestions are appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.