🍜VPhoBertTagger

Token classification using Phobert Models for 🇻🇳Vietnamese

🏞️Environments🏞️

Get started in seconds with verified environments. Run script below for install all dependencies

bash ./install_dependencies.sh

📚Dataset📚

The input data's format of 🍜VPhoBertTagger follows VLSP-2016 format with four columns separated by a tab character, including of word, pos, chunk, and named entity. Each word which was segmented has been put on a separate line and there is an empty line after each sentence. For details, see sample data in 'datasets/samples' directory. The table below describes an example Vietnamese sentence in dataset.

Word	POS	Chunk	NER
Dương	Np	B-NP	B-PER
là	V	B-VP	O
một	M	B-NP	O
chủ	N	B-NP	O
cửa hàng	N	B-NP	O
lâu	A	B-AP	O
năm	N	B-NP	O
ở	E	B-PP	O
Hà Nội	Np	B-NP	B-LOC
.	CH	O	O

The dataset must put on directory with structure as below.

├── data_dir
|  └── train.txt
|  └── dev.txt
|  └── test.txt

🎓Training🎓

The commands below fine-tune PhoBert for Token-classification task. Models download automatically from the latest Hugging Face release

python main.py train --task vlsp2016 --run_test --data_dir ./datasets/vlsp2016 --model_name_or_path vinai/phobert-base --model_arch softmax --output_dir outputs --max_seq_length 256 --train_batch_size 32 --eval_batch_size 32 --learning_rate 3e-5 --epochs 20 --early_stop 2 --overwrite_data

bash ./train.sh

Arguments:

type (str,*required): What is process type to be run. Must in [train, test, predict, demo].

task (str, *optional): Training task selected in the list: [vlsp2016, vlsp2018_l1, vlsp2018_l2, vlsp2018_join]. Default: vlsp2016

data_dir (Union[str, os.PathLike], *required): The input data dir. Should contain the .csv files (or other data files) for the task.

overwrite_data (bool, *optional) : Whether not to overwirte splitted dataset. Default=False

load_weights (Union[str, os.PathLike], *optional): Path of pretrained file.

model_name_or_path (str, *required): Pre-trained model selected in the list: [vinai/phobert-base, vinai/phobert-large,...] Default=vinai/phobert-base

model_arch (str, *required): Punctuation prediction model architecture selected in the list: [softmax, crf, lstm_crf].

output_dir (Union[str, os.PathLike], *required): The output directory where the model predictions and checkpoints will be written.

max_seq_length (int, *optional): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default=190.

train_batch_size (int, *optional): Total batch size for training. Default=32.

eval_batch_size (int, *optional): Total batch size for eval. Default=32.

learning_rate (float, *optional): The initial learning rate for Adam. Default=1e-4.

classifier_learning_rate (float, *optional): The initial classifier learning rate for Adam. Default=5e-4.

epochs (float, *optional): Total number of training epochs to perform. Default=100.0.

weight_decay (float, *optional): Weight deay if we apply some. Default=0.01.

adam_epsilon (float, *optional): Epsilon for Adam optimizer. Default=5e-8.

max_grad_norm (float, *optional): Max gradient norm. Default=1.0.

early_stop (float, *optional): Number of early stop step. Default=10.0.

no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

run_test (bool, *optional): Whether not to run evaluate best model on test set after train. Default=False.

seed (bool, *optional): Random seed for initialization. Default=42.

num_workers (int, *optional): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Default=0.

save_step (int, *optional): The number of steps in the model will be saved. Default=10000.

gradient_accumulation_steps (int, *optional): Number of updates steps to accumulate before performing a backward/update pass. Default=1.

📈Tensorboard📈

The command below start Tensorboard help you follow fine-tune process.

tensorboard --logdir runs --host 0.0.0.0 --port=6006

🥇Performances🥇

All experiments were performed on an RTX 3090 with 24GB VRAM, and a CPU Xeon® E5-2678 v3 with 64GB RAM, both of which are available for rent on vast.ai. The pretrained-model used for comparison are available on HuggingFace.

VLSP 2016

Click to expand!

Model		BIO-Metrics				NE-Metrics					Log
Model		Accuracy	Precision	Recall	F1-score	Accuracy (w/o 'O')	Accuracy	Precision	Recall	F1-score	Log
Bert-base-multilingual-cased [1]	Softmax	0.9905	0.9239	0.8776	0.8984	0.9068	0.9905	0.8938	0.8941	0.8939	Maxtrix Log
	CRF	0.9903	0.9241	0.8880	0.9048	0.9087	0.9903	0.8951	0.8945	0.8948	Maxtrix Log
	LSTM_CRF	0.9905	0.9183	0.8898	0.9027	0.9178	0.9905	0.8879	0.8992	0.8935	Maxtrix Log
PhoBert-base [2]	Softmax	0.9950	0.9312	0.9404	0.9348	0.9570	0.9950	0.9434	0.9466	0.9450	Maxtrix Log
	CRF	0.9949	0.9497	0.9248	0.9359	0.9525	0.9949	0.9516	0.9456	0.9486	Maxtrix Log
	LSTM_CRF	0.9949	0.9535	0.9181	0.9349	0.9456	0.9949	0.9520	0.9396	0.9457	Maxtrix Log
viBERT [3]	Softmax	...	...	...	...	...	...	...	...	...	...
	CRF	...	...	...	...	...	...	...	...	...	...
	LSTM_CRF	...	...	...	...	...	...	...	...	...	...

VLSP 2018

Level 1

Click to expand!

Model		BIO-Metrics				NE-Metrics					Epoch
Model		Accuracy	Precision	Recall	F1-score	Accuracy (w/o 'O')	Accuracy	Precision	Recall	F1-score	Epoch
Bert-base-multilingual-cased [1]	Softmax	0.9828	0.7421	0.7980	0.7671	0.8510	0.9828	0.7302	0.8339	0.7786	Maxtrix Log
	CRF	0.9824	0.7716	0.7619	0.7601	0.8284	0.9824	0.7542	0.8127	0.7824	Maxtrix Log
	LSTM_CRF	0.9829	0.7533	0.7750	0.7626	0.8296	0.9829	0.7612	0.8122	0.7859	Maxtrix Log
PhoBert-base [2]	Softmax	0.9896	0.7970	0.8404	0.8170	0.8892	0.9896	0.8421	0.8942	0.8674	Maxtrix Log
	CRF	0.9903	0.8124	0.8428	0.8260	0.8834	0.9903	0.8695	0.8943	0.8817	Maxtrix Log
	LSTM+CRF	0.9901	0.8240	0.8278	0.8241	0.8715	0.9901	0.8671	0.8773	0.8721	Maxtrix Log
viBERT [3]	Softmax	...	...	...	...	...	...	...	...	...	...
	CRF	...	...	...	...	...	...	...	...	...	...
	LSTM_CRF	...	...	...	...	...	...	...	...	...	...

Level 2

Click to expand!

Model		BIO-Metrics				NE-Metrics					Epoch
Model		Accuracy	Precision	Recall	F1-score	Accuracy (w/o 'O')	Accuracy	Precision	Recall	F1-score	Epoch
Bert-base-multilingual-cased [1]	Softmax	...	...	...	...	...	...	...	...	...	...
	CRF	...	...	...	...	...	...	...	...	...	...
	LSTM_CRF	...	...	...	...	...	...	...	...	...	...
PhoBert-base [2]	Softmax	...	...	...	...	...	...	...	...	...	...
	CRF	...	...	...	...	...	...	...	...	...	...
	LSTM+CRF	...	...	...	...	...	...	...	...	...	...
viBERT [3]	Softmax	...	...	...	...	...	...	...	...	...	...
	CRF	...	...	...	...	...	...	...	...	...	...
	LSTM_CRF	...	...	...	...	...	...	...	...	...	...

Join

Click to expand!

Model		BIO-Metrics				NE-Metrics					Epoch
Model		Accuracy	Precision	Recall	F1-score	Accuracy (w/o 'O')	Accuracy	Precision	Recall	F1-score	Epoch
Bert-base-multilingual-cased [1]	Softmax	...	...	...	...	...	...	...	...	...	...
	CRF	...	...	...	...	...	...	...	...	...	...
	LSTM_CRF	...	...	...	...	...	...	...	...	...	...
PhoBert-base [2]	...	...	...	...	...	...	...	...	...	...
	CRF	...	...	...	...	...	...	...	...	...	...
	LSTM+CRF	...	...	...	...	...	...	...	...	...	...
viBERT [3]	Softmax	...	...	...	...	...	...	...	...	...	...
	CRF	...	...	...	...	...	...	...	...	...	...
	LSTM_CRF	...	...	...	...	...	...	...	...	...	...

References

[1] Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).

[2] Nguyen, D. Q., & Nguyen, A. T. (2020, November). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037-1042).

[3] The, V. B., Thi, O. T., & Le-Hong, P. (2020). Improving sequence tagging for vietnamese text using transformer-based neural models. arXiv preprint arXiv:2006.15994.

🧠Inference🧠

The command below load your fine-tuned model and inference in your text input.

python main.py predict --model_path outputs/best_model.pt

Arguments:

type (str,*required): What is process type to be run. Must in [train, test, predict, demo].

model_path (Union[str, os.PathLike], *optional): Path of pretrained file.

no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

🌟Demo🌟

The command below load your fine-tuned model and start demo page.

python main.py demo --model_path outputs/best_model.pt

Arguments:

type (str,*required): What is process type to be run. Must in [train, test, predict, demo].

model_path (Union[str, os.PathLike], *optional): Path of pretrained file.

no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

💡Acknowledgements💡

Pretrained model Phobert by VinAI Research and Pytorch implementation by Hugging Face.

datnnt1997 / vphoberttagger Goto Github PK

vphoberttagger's Introduction

🍜VPhoBertTagger

🏞️Environments🏞️

📚Dataset📚

🎓Training🎓

📈Tensorboard📈

🥇Performances🥇

VLSP 2016

VLSP 2018

Level 1

Level 2

Join

References

🧠Inference🧠

🌟Demo🌟

💡Acknowledgements💡

vphoberttagger's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org