The banglabait from mdmotaharmahtab

BanglaBait: Semi-Supervised Adversarial Approach for Clickbait Detection on Bangla Clickbait Dataset

Abstract

Intentionally luring readers to click on a particular content by exploiting their curiosity defines a title as clickbait. Although several studies focused on detecting clickbait titles in English articles, low-resource language like Bangla has not been given adequate attention. To tackle clickbait titles in Bangla, we have constructed the first Bangla clickbait detection dataset containing 15,406 labeled news articles and 65,405 unlabelled news articles extracted from clickbait-y news sites. Each article has been labeled by three expert linguists and includes an article's title, body, and other metadata. By incorporating labeled and unlabelled data, we finetune a pre-trained Bangla transformer model in an adversarial fashion using Semi-Supervised Generative Adversarial Networks (SS-GANs). The proposed model acts as a good baseline for this dataset, outperforming traditional neural network models (LSTM, GRU, CNN) and linguistic feature-based models. We expect that this dataset and the detailed analysis and comparison of these clickbait detection models will provide a fundamental basis for future research into detecting clickbait titles in Bengali articles. Full paper can be accessed here. The dataset can be accessed here.

Usage

Requires the following packages:

Python 3.8+

It is recommended to use virtual environment packages such as virtualenv or conda Follow the steps below to setup project:

Clone this repository.

git clone https://github.com/mdmotaharmahtab/BanglaBait-Semi-Supervised-Adversarial-Approach-for-Clickbait-Detection-on-Bangla-Clickbait-Dataset.git

create a conda environment and activate it-

conda create -n env_name python=3.8
conda activate env_name

Install necessary paackages - pip install -r requirements.txt
Run bash download_data.sh file to download Bangla Clickbait data. It also downloads Bangla Word2Vec, Fasttext embedding and Bangla Pars-of-Speech model from bnlp toolkit.

BanglaBait Dataset

The datasets used in our paper are available here

List of files

train.csv
dev.csv
test.csv
unlabelled_data.csv

File Format

Column Title	Description
domain	article link
data	article publish time
category	Category of the news
title	Headline of the news
content	Article or body of the news
translated_title	English translated headline of the news
translated_content	English translated article or body of the news
label	1 or 0 . '1' for clickbait '0' for non-clickbait

Training

Statistical Models

Go to root dir of repo
Run:

python train/run_statistical.py --config config/statistical_config.yaml

Models:
- LR (Logistic Regression)
- RF (Random Forest)

To choose between Logistic Regression or Random Forest model and choose which features to use, edit statistical_config.yaml file. The config files are based on yaml format.

LR: Logistic Regression
- C: LR - inverse of regularization strength
- penalty: LR - norm of the penalty
- solver: LR -solver
RF: Random Forest
- criterion: RF - function for quality split
- max_depth: RF - max tree depth
- max_features: RF - best split max feature number
- n_estimators: RF - number of trees
data_column: column name in dataframe which used to train model.
data_dir: Folder path of clickbait data. default dataset/data/clickbait.
feature: Feature name to use to train model
label_column: column name in dataframe holds data labels
model_type: which model to use LR/RF.
save_model: whether to save the trained model or not

Sample config file format to run logisitc regression model by using unigram feature:

LR:
  C: 10
  penalty: l2
  solver: liblinear
RF:
  criterion: gini
  max_depth: 100
  max_features: sqrt
  n_estimators: 700
data_column: title # use this column in dataframe for training data
data_dir: dataset/data/clickbait
feature: unigram # which feature to use for training
label_column: label # which column in dataframe contains clickbait label
model_type: LR
save_model: true # if true, save trained model

All possible features for statistical models are derived from BanFake paper.

Possible choices of feature in statistical_config.yaml file:
- unigram
- bigram
- trigram
- u_b_t: uni,bi and trigram combined
- char_3: character 3 gram
- char_4
- char_5
- char_3_4_5
- lexical: u_b_t and char_3_4_5 combined
- pos: normalized frequency of pars-of-speech
- L_POS: pars-of-speech and lexical combined
- word_100: Bangla word2vec feature
- word_300: Bangla Fasttext feature
- L_POS_Emb_F: lexical, word_300,part-of-speech combined
- L_POS_Emb_N: lexical, word_100,part-of-speech combined
- mp: normalized punctuation frequency feature
- L_POS_Emb_F_MP: lexical, word_300, part-of-speech, punctuation combined
- L_POS_Emb_N_MP: lexical, word_100, part-of-speech, punctuation combined
- allfeatures: all features combined

NN Models

Go to root dir of repo
Run:

CUDA_VISIBLE_DEVICES=0 python train/run_nn.py --config config/nn_config.yaml

Models:
- BiLSTM-Attention or only BiLSTM
- CNN

To choose between BiLSTM or CNN model and choose which embedding to use, edit nn_config.yaml file. Sample config file for a BiLSTM-attn model with Bangla Fasttext embedding is given below -

model_type: lstm
use_class_weight: true # whether to use weighted cross entropy loss

lstm:
  use_attention: true # whether to use self attention layer on top of biLSTM stack
  out_dropout_rate : 0.5 #dropout rate between LSTM layers
  lstm_hidden_dim: 256 #lstm layer hidden size
  lstm_layers: 2 #number of biLSTM layers
  bidirectional: true #if true, use bidirectional lstm
  pad_idx : 0 #vocab id of pad token
  
cnn:
  in_channels : 1 #number of input channels, always 1
  out_channels : 256 # number of output channels
  kernel_heights : [1, 2, 3, 4] # kernel lengths
  stride : 1 # convolution stride
  padding : 0 # vocab id of pad token
  keep_probab : 0.8 # Probability of retaining an activation node during dropout operation

learning_rate : !!float 2e-05
batch_size : 64

embedding: fasttext
embedding_path: embedding/bengali_fasttext_wiki.bin # relative path of bangla embedding
update_embedding: false # whether to update bangla pretrained word embeddings
load_embedding_path: embedding/embedding_weights_fasttext.pt # if word embedding tensor is saved in this path, load from this path
epochs : 20
device : cuda
output_path: output/lstm_test_1
data_dir: dataset/data/clickbait
data_column: title # use this column in dataframe for training data
label_column: label
label_list: [0,1] # 0=non clickbait, 1=clickbait

To use Bangla Word2Vec features, change following:

embedding: word2vec
embedding_path: embedding/bangla_word2vec/bnwiki_word2vec.vector

To use CNN model, change following:

model_type: cnn

To run BiLSTM model without attention, change following:

lstm:
  use_attention: false

Other parameters can also be tweaked from the nn_config.yaml file. Like lstm_layers control the number of stacked LSTM layers and bidirectional controls whether to use bidirectional LSTM or unidirectional LSTM.

Transformer Models

Go to root dir of repo
Run:

CUDA_VISIBLE_DEVICES=0 python train/run_transformer.py --config config/transformer_config.yaml

To choose between differenet transformer models and tweak model parameters, edit transformer_config.yaml file. Sample config file for a BanglaBERT model is given below -

batch_size: 64
data_dir: dataset/data/clickbait
device: cuda
epochs: 20
label_list:
- 0
- 1
learning_rate: 1.0e-05
model_name_or_path: csebuetnlp/banglabert # transformers model name from huggingface hub
model_type: transformer # transformer for supervised training
out_dropout_rate: 0.5 # dropout layer between transformer and final classification layer
output_path: output/transformer_banglaberttest_1
tokenizer_class: AutoTokenizer
warmup_proportion: 0.1 # warmup rate for learning rate scheduler

The model_name_or_path parameter can be changed to test other pre-trained transformer models.

Semi Supervised GAN Transformer Models

Go to root dir of repo
Run:

CUDA_VISIBLE_DEVICES=0 python train/run_transformer.py --config config/gan_config.yaml

To choose between differenet transformer models and tweak model parameters, edit gan_config.yaml file. Sample config file for a GAN-BanglaBERT model is given below -

batch_size: 64
data_dir: dataset/data/clickbait
device: cuda
epochs: 20
epsilon: 1.0e-08
label_list:
- 0
- 1
- 2
learning_rate_discriminator: 1.0e-05
learning_rate_generator: 1.0e-05
model_name_or_path: csebuetnlp/banglabert
model_type: gan
num_hidden_layers_d: 2  #number of fully connected layers in discriminator network
num_hidden_layers_g: 2 #number of fully connected layers in generator network
out_dropout_rate: 0.5
output_path: output/gan_banglabert_test_1
tokenizer_class: AutoTokenizer
warmup_proportion: 0.1

The model_name_or_path parameter can be changed to test other pre-trained transformer models. Other parameters that can be tweaked to change model architecture are -

The model_name_or_path parameter can be changed to train other transformer models.

mdmotaharmahtab / banglabait Goto Github PK

banglabait's Introduction

BanglaBait: Semi-Supervised Adversarial Approach for Clickbait Detection on Bangla Clickbait Dataset

Guide

Usage

BanglaBait Dataset

List of files

Training

Statistical Models

NN Models

Transformer Models

Semi Supervised GAN Transformer Models

banglabait's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent