The bert_opioid_drug_data_tutorial from ashwinambal

Tutorial to run BERT for Text Classification

There are three folders in the Tutorial:
Code- Contains the text classification code
Data- Contains the Train, Dev and Test Data

Input data needs to be TAB delimited and have three columns namely: ID, Text, Label

To run the tutorial with the given data, simply run bash Run_BERT_Opioid_tutorial.sh OR sbatch Run_BERT_Opioid_tutorial.sh (in the Agave Cluster or any SLURM cluster)

Run_BERT_Opioid_tutorial.sh- Script to run

Refer agaveintro.pdf for all arguments to the slurm job
Set SBATCH --mail-user=[email protected] command to the email ID of the user to get email notifications of any update on the job

All accepted arguments to the run_classifier_tutorial_opioid.py code are given below:

 usage: run_classifier_tutorial_opioid.py [-h] --data_dir DATA_DIR --bert_model
                                      BERT_MODEL --task_name TASK_NAME
                                      --output_dir OUTPUT_DIR
                                      [--cache_dir CACHE_DIR]
                                      [--max_seq_length MAX_SEQ_LENGTH]
                                      [--do_train] [--do_eval]
                                      [--do_lower_case]
                                      [--train_batch_size TRAIN_BATCH_SIZE]
                                      [--eval_batch_size EVAL_BATCH_SIZE]
                                      [--learning_rate LEARNING_RATE]
                                      [--num_train_epochs NUM_TRAIN_EPOCHS]
                                      [--warmup_proportion WARMUP_PROPORTION]
                                      [--no_cuda] [--local_rank LOCAL_RANK]
                                      [--seed SEED]
                                      [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                                      [--fp16] [--loss_scale LOSS_SCALE]
                                      [--server_ip SERVER_IP]
                                      [--server_port SERVER_PORT]

The arguments in [] are optional and have default values set

BERT_Output- Contains the predicted labels, evaluation results, etc. which are the output of the classifier (Folder Required to run evaluation on the test set)

pytorch_model.bin- The trained model (Required for test)
bert_config.json- The saved configuration with which BERT was trained (Required for test)
eval_results.txt- The evaluation results on the test set
Reqd_Labels.csv- The predicted and gold standard labels for each sentence

Note: To run on a new dataset with name task_name, do the following:

Copy the InputProcessor class and make a new class with name task_name_InputProcessor
Change the get_labels function in the task_name_InputProcessor class to return a list of all possible labels for the new dataset
Add task_name to the processors dictionary under main and map it to task_name_InputProcessor
Add task_name to the num_labels_task dictionary under main and map it to the number of labels in the task. Note that this number should match with the number of labels returned by get_labels in Step 2.
Finally, make sure to call the python code with the appropriate task_name as an argument.
The appropriate Train-Test-Dev data should be present in the Data Directory specified in the args to the python file. This data should be delimited by TAB and have three columns namely: ID, Text, Label in that order.

Due to data privacy issues, the files in Data have been corrupted but kept with the same format (to show how data is formatted). The Data is not yet public and hence not made available here. Nevertheless, the code should work for any dataset if the above steps are followed. This Code uses three labels for the data namely ['Q', 'A', 'O'] which stands for Questions, Answers and Others.

For further reading, refer:
BERT Paper- [https://arxiv.org/abs/1810.04805]
Estimator- [https://github.com/google-research/bert#fine-tuning-with-bert]
Pytorch- [https://github.com/huggingface/pytorch-pretrained-BERT]

ashwinambal / bert_opioid_drug_data_tutorial Goto Github PK

bert_opioid_drug_data_tutorial's Introduction

Tutorial to run BERT for Text Classification

bert_opioid_drug_data_tutorial's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent