Code Monkey home page Code Monkey logo

bert's Introduction

BERT-th

Google's BERT is currently the state-of-the-art method of pre-training text representations which additionally provides multilingual models. Unfortunately, Thai is the only one in 103 languages that is excluded due to difficulties in word segmentation.

BERT-th presents the Thai-only pre-trained model based on the BERT-Base structure. It is now available to download.

BERT-th also includes relevant codes and scripts along with the pre-trained model, all of which are the modified versions of those in the original BERT project.

Preprocessing

Data Source

Training data for BERT-th come from the latest article dump of Thai Wikipedia on November 2, 2018. The raw texts are extracted by using WikiExtractor.

Sentence Segmentation

Input data need to be segmented into separate sentences before further processing by BERT modules. Since Thai language has no explicit marker at the end of a sentence, it is quite problematic to pinpoint sentence boundaries. To the best of our knowledge, there is still no implementation of Thai sentence segmentation elsewhere. So, in this project, sentence segmentation is done by applying simple heuristics, considering spaces, sentence length and common conjunctions.

After preprocessing, the training corpus consists of approximately 2 million sentences and 40 million words (counting words after word segmentation by PyThaiNLP). The plain and segmented texts can be downloaded here.

Tokenization

BERT uses WordPiece as a tokenization mechanism. But it is Google internal, we cannot apply existing Thai word segmentation and then utilize WordPiece to learn the set of subword units. The best alternative is SentencePiece which implements BPE and needs no word segmentation.

In this project, we adopt a pre-trained Thai SentencePiece model from BPEmb. The model of 25000 vocabularies is chosen and the vocabulary file has to be augmented with BERT's special characters, including '[PAD]', '[CLS]', '[SEP]' and '[MASK]'. The model and vocabulary files can be downloaded here.

SentencePiece and bpe_helper.py from BPEmb are both used to tokenize data. ThaiTokenizer class has been added to BERT's tokenization.py for tokenizing Thai texts.

Pre-training

The data can be prepared before pre-training by using this script.

export BPE_DIR=/path/to/bpe
export TEXT_DIR=/path/to/text
export DATA_DIR=/path/to/data

python create_pretraining_data.py \
  --input_file=$TEXT_DIR/thaiwikitext_sentseg \
  --output_file=$DATA_DIR/tf_examples.tfrecord \
  --vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5 \
  --thai_text=True \
  --spm_file=$BPE_DIR/th.wiki.bpe.op25000.model

Then, the following script can be run to learn a model from scratch.

export DATA_DIR=/path/to/data
export BERT_BASE_DIR=/path/to/bert_base

python run_pretraining.py \
  --input_file=$DATA_DIR/tf_examples.tfrecord \
  --output_dir=$BERT_BASE_DIR \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --train_batch_size=32 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=1000000 \
  --num_warmup_steps=100000 \
  --learning_rate=1e-4 \
  --save_checkpoints_steps=200000

We have trained the model for 1 million steps. On Tesla K80 GPU, it took around 20 days to complete. Though, we provide a snapshot at 0.8 million steps because it yields better results for downstream classification tasks.

Downstream Classification Tasks

XNLI

XNLI is a dataset for evaluating a cross-lingual inferential classification task. The development and test sets contain 15 languages which data are thoroughly edited. The machine-translated versions of training data are also provided.

The Thai-only pre-trained BERT model can be applied to the XNLI task by using training data which are translated to Thai. Spaces between words in the training data need to be removed to make them consistent with inputs in the pre-training step. The processed files of XNLI related to Thai language can be downloaded here.

Afterwards, the XNLI task can be learned by using this script.

export BPE_DIR=/path/to/bpe
export XNLI_DIR=/path/to/xnli
export OUTPUT_DIR=/path/to/output
export BERT_BASE_DIR=/path/to/bert_base

python run_classifier.py \
  --task_name=XNLI \
  --do_train=true \
  --do_eval=true \
  --data_dir=$XNLI_DIR \
  --vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=5e-5 \
  --num_train_epochs=2.0 \
  --output_dir=$OUTPUT_DIR \
  --xnli_language=th \
  --spm_file=$BPE_DIR/th.wiki.bpe.op25000.model

This table compares the Thai-only model with XNLI baselines and the Multilingual Cased model which is also trained by using translated data.

XNLI Baseline BERT
Translate Train Translate Test Multilingual Model Thai-only Model
62.8 64.4 66.1 68.9

Wongnai Review Dataset

Wongnai Review Dataset collects restaurant reviews and ratings from Wongnai website. The task is to classify a review into one of five ratings (1 to 5 stars). The dataset can be downloaded here and the following script can be run to use the Thai-only model for this task.

export BPE_DIR=/path/to/bpe
export WONGNAI_DIR=/path/to/wongnai
export OUTPUT_DIR=/path/to/output
export BERT_BASE_DIR=/path/to/bert_base

python run_classifier.py \
  --task_name=wongnai \
  --do_train=true \
  --do_predict=true \
  --data_dir=$WONGNAI_DIR \
  --vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=5e-5 \
  --num_train_epochs=2.0 \
  --output_dir=$OUTPUT_DIR \
  --spm_file=$BPE_DIR/th.wiki.bpe.op25000.model

Without additional preprocessing and further fine-tuning, the Thai-only BERT model can achieve 0.56612 and 0.57057 for public and private test-set scores respectively.

bert's People

Contributors

jacobdevlin-google avatar u41ppp avatar cbockman avatar abhishekraok avatar bogdandidenko avatar eric-haibin-lin avatar zhaoyongke avatar stefan-it avatar pengli09 avatar jasonjpu avatar qwfy avatar rodgzilla avatar georgefeng avatar craigcitro avatar imcaspar avatar ammarasmro avatar aijunbai avatar 0xflotus avatar

Watchers

James Cloos avatar Warunyou Dej-Udom avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.