Code Monkey home page Code Monkey logo

vphoberttagger's Introduction

🍜VPhoBertTagger

Token classification using Phobert Models for 🇻🇳Vietnamese

🏞️Environments🏞️

Get started in seconds with verified environments. Run script below for install all dependencies

bash ./install_dependencies.sh

📚Dataset📚

The input data's format of 🍜VPhoBertTagger follows VLSP-2016 format with four columns separated by a tab character, including of word, pos, chunk, and named entity. Each word which was segmented has been put on a separate line and there is an empty line after each sentence. For details, see sample data in 'datasets/samples' directory. The table below describes an example Vietnamese sentence in dataset.

Word POS Chunk NER
Dương Np B-NP B-PER
V B-VP O
một M B-NP O
chủ N B-NP O
cửa hàng N B-NP O
lâu A B-AP O
năm N B-NP O
E B-PP O
Hà Nội Np B-NP B-LOC
. CH O O

The dataset must put on directory with structure as below.

├── data_dir
|  └── train.txt
|  └── dev.txt
|  └── test.txt

🎓Training🎓

The commands below fine-tune PhoBert for Token-classification task. Models download automatically from the latest Hugging Face release

python main.py train --task vlsp2016 --run_test --data_dir ./datasets/vlsp2016 --model_name_or_path vinai/phobert-base --model_arch softmax --output_dir outputs --max_seq_length 256 --train_batch_size 32 --eval_batch_size 32 --learning_rate 3e-5 --epochs 20 --early_stop 2 --overwrite_data

or

bash ./train.sh

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • task (str, *optional): Training task selected in the list: [vlsp2016, vlsp2018_l1, vlsp2018_l2, vlsp2018_join]. Default: vlsp2016
  • data_dir (Union[str, os.PathLike], *required): The input data dir. Should contain the .csv files (or other data files) for the task.
  • overwrite_data (bool, *optional) : Whether not to overwirte splitted dataset. Default=False
  • load_weights (Union[str, os.PathLike], *optional): Path of pretrained file.
  • model_name_or_path (str, *required): Pre-trained model selected in the list: [vinai/phobert-base, vinai/phobert-large,...] Default=vinai/phobert-base
  • model_arch (str, *required): Punctuation prediction model architecture selected in the list: [softmax, crf, lstm_crf].
  • output_dir (Union[str, os.PathLike], *required): The output directory where the model predictions and checkpoints will be written.
  • max_seq_length (int, *optional): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default=190.
  • train_batch_size (int, *optional): Total batch size for training. Default=32.
  • eval_batch_size (int, *optional): Total batch size for eval. Default=32.
  • learning_rate (float, *optional): The initial learning rate for Adam. Default=1e-4.
  • classifier_learning_rate (float, *optional): The initial classifier learning rate for Adam. Default=5e-4.
  • epochs (float, *optional): Total number of training epochs to perform. Default=100.0.
  • weight_decay (float, *optional): Weight deay if we apply some. Default=0.01.
  • adam_epsilon (float, *optional): Epsilon for Adam optimizer. Default=5e-8.
  • max_grad_norm (float, *optional): Max gradient norm. Default=1.0.
  • early_stop (float, *optional): Number of early stop step. Default=10.0.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.
  • run_test (bool, *optional): Whether not to run evaluate best model on test set after train. Default=False.
  • seed (bool, *optional): Random seed for initialization. Default=42.
  • num_workers (int, *optional): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Default=0.
  • save_step (int, *optional): The number of steps in the model will be saved. Default=10000.
  • gradient_accumulation_steps (int, *optional): Number of updates steps to accumulate before performing a backward/update pass. Default=1.

📈Tensorboard📈

The command below start Tensorboard help you follow fine-tune process.

tensorboard --logdir runs --host 0.0.0.0 --port=6006

🥇Performances🥇

All experiments were performed on an RTX 3090 with 24GB VRAM, and a CPU Xeon® E5-2678 v3 with 64GB RAM, both of which are available for rent on vast.ai. The pretrained-model used for comparison are available on HuggingFace.

VLSP 2016

Click to expand!
Model BIO-Metrics NE-Metrics Log
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax 0.9905 0.9239 0.8776 0.8984 0.9068 0.9905 0.8938 0.8941 0.8939 Maxtrix
Log
CRF 0.9903 0.9241 0.8880 0.9048 0.9087 0.9903 0.8951 0.8945 0.8948 Maxtrix
Log
LSTM_CRF 0.9905 0.9183 0.8898 0.9027 0.9178 0.9905 0.8879 0.8992 0.8935 Maxtrix
Log
PhoBert-base [2] Softmax 0.9950 0.9312 0.9404 0.9348 0.9570 0.9950 0.9434 0.9466 0.9450 Maxtrix
Log
CRF 0.9949 0.9497 0.9248 0.9359 0.9525 0.9949 0.9516 0.9456 0.9486 Maxtrix
Log
LSTM_CRF 0.9949 0.9535 0.9181 0.9349 0.9456 0.9949 0.9520 0.9396 0.9457 Maxtrix
Log
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

VLSP 2018

Level 1

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax 0.9828 0.7421 0.7980 0.7671 0.8510 0.9828 0.7302 0.8339 0.7786 Maxtrix
Log
CRF 0.9824 0.7716 0.7619 0.7601 0.8284 0.9824 0.7542 0.8127 0.7824 Maxtrix
Log
LSTM_CRF 0.9829 0.7533 0.7750 0.7626 0.8296 0.9829 0.7612 0.8122 0.7859 Maxtrix
Log
PhoBert-base [2] Softmax 0.9896 0.7970 0.8404 0.8170 0.8892 0.9896 0.8421 0.8942 0.8674 Maxtrix
Log
CRF 0.9903 0.8124 0.8428 0.8260 0.8834 0.9903 0.8695 0.8943 0.8817 Maxtrix
Log
LSTM+CRF 0.9901 0.8240 0.8278 0.8241 0.8715 0.9901 0.8671 0.8773 0.8721 Maxtrix
Log
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

Level 2

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...
PhoBert-base [2] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM+CRF ... ... ... ... ... ... ... ... ... ...
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

Join

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...
PhoBert-base [2] ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM+CRF ... ... ... ... ... ... ... ... ... ...
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

References

[1] Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).

[2] Nguyen, D. Q., & Nguyen, A. T. (2020, November). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037-1042).

[3] The, V. B., Thi, O. T., & Le-Hong, P. (2020). Improving sequence tagging for vietnamese text using transformer-based neural models. arXiv preprint arXiv:2006.15994.

🧠Inference🧠

The command below load your fine-tuned model and inference in your text input.

python main.py predict --model_path outputs/best_model.pt

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • model_path (Union[str, os.PathLike], *optional): Path of pretrained file.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

🌟Demo🌟

The command below load your fine-tuned model and start demo page.

python main.py demo --model_path outputs/best_model.pt

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • model_path (Union[str, os.PathLike], *optional): Path of pretrained file.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

💡Acknowledgements💡

Pretrained model Phobert by VinAI Research and Pytorch implementation by Hugging Face.

vphoberttagger's People

Contributors

datnnt1997 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.