Code Monkey home page Code Monkey logo

kortok's Introduction

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks [pdf]
Kyubyong Park*, Joohong Lee*, Seongbo Jang*, Dawoon Jung*
Accepted to AACL-IJCNLP 2020. (*indicates equal contribution)

Abstract: Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.
Even though Byte Pair Encoding (BPE) has been considered thede facto standard tokenization method due to its simplicity and universality, it still remains unclear whether BPE works best across all languages and tasks. In this paper, we test several tokenization strategies in order to answer our primary research question, that is, "What is the best tokenization strategy for Korean NLP tasks?"
Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X. As an exception, for KorQuAD, the Korean extension of SQuAD, BPE segmentation turns out to be the most effective.

Installation

pip install -r requirements.txt

Tokenization Strategies

There are 6 tokenization strategies for Korean. See here to prepare and use each strategy.

  1. Consonant and Vowel
  2. Syllable
  3. Morpheme
  4. Subword
  5. Morpheme-aware Subword
  6. Word

The corpus used for building vocabulary and training BPE models is as follows, which was extracted and refined via attardi/wikiextractor.

Korean from/to English Translation

Tokenization Vocab Size ko-en (Dev) ko-en (Test) en-ko (Dev) en-ko (Test) OOV Rate Avg. Length
CV 166 39.11 38.56 36.52 36.45 0.02 142.75
Syllable 2K 39.3 38.75 38.64 38.45 0.06 69.20
Morpheme 8K 31.59 31.24 32.44 32.19 7.51 49.19
16K 34.38 33.8 35.74 35.52 4.67 49.19
32K 36.19 35.74 36.51 36.12 2.72 49.19
64K 37.88 37.37 37.51 37.03 1.4 49.19
Subword 4K 39.18 38.75 38.31 38.18 0.07 48.02
8K 39.16 38.75 38.09 37.94 0.08 38.44
16K 39.22 38.77 37.64 37.34 0.1 33.69
32K 39.05 38.69 37.11 36.98 0.11 30.21
64K 37.02 36.46 35.77 35.64 0.12 27.50
Morpheme-aware Subword 4K 39.41 38.95 39.29 39.13 0.06 65.17
8K 39.42 39.06 39.78 39.61 0.06 56.79
16K 39.84 39.41 40.23 40.04 0.07 53.30
32K 41.00 40.34 40.43 40.41 0.07 51.38
64K 39.62 39.34 38.63 38.42 0.07 50.27
Word 64K 7.04 7.07 18.68 18.42 26.2 18.96

Dataset

Recently, Korean-English parallel corpus was publicly released by AI Hub, which was gathered from various sources such as news, government web sites, legal documents, etc. We downloaded the news data which amount to 800K sentence pairs, and randomly split them into 784K (train), 8K (dev), and 8K (test).

Training & Evaluation

We ran all the experiments using pytorch/fairseq (Ott et al., 2019), a PyTorch based deep learning library for sequence to sequence models.

1. Preprocess

fairseq-preprocess \
--source-lang ko \
--target-lang en \
--trainpref ./dataset/translation/mecab_sp-8k/train \
--validpref ./dataset/translation/mecab_sp-8k/dev \
--testpref ./dataset/translation/mecab_sp-8k/test \
--destdir ./dataset/translation/mecab_sp-8k/preprocessed/ko-en \
--srcdict ./resources/en_sp-32k/fairseq.vocab \
--tgtdict ./resources/mecab_sp-8k/fairseq.vocab

2. Training

We used Transformer (Vaswani et al., 2017), the state-of-the-art model for neural machine translation. We mostly followed the base model configuration: 6 blocks of 512-2048 units with 8 attention heads.

fairseq-train ./dataset/translation/mecab_sp-8k/preprocessed/ko-en \
--arch transformer \
--share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-epoch 50 \
--batch-size 128 \
--save-dir translation_ckpt/mecab_sp-8k/ko-en \
--disable-validation

3. Evaluation

We report BLEU scores on both the dev and the test sets using Moses multi-bleu.perl script. Following WAT 2019 (Nakazawa et al., 2019), Moses tokenizer and MeCab-ko are used for tokenizing the evaluation data.

fairseq-generate ./dataset/translation/mecab_sp-8k/preprocessed \
--path translation_ckpt/mecab_sp-8k/checkpoint_best.pt \
--batch-size 512 \
--beam 5 \
--remove-bpe sentencepiece

Korean Natural Language Understanding

Tokenization Vocab Size KorQuAD KorNLI KorSTS NSMC PAWS-X
CV 166 59.66 / 73.91 70.6 71.2 77.22 71.47
Syllable 2K 69.10 / 83.29 73.98 73.47 82.7 75.86
Morpheme 32K 68.05 / 83.82 74.86 74.37 82.37 76.83
64K 70.68 / 85.25 75.06 75.69 83.21 77.38
Subword 4K 71.48 / 83.11 74.38 74.03 83.37 76.8
8K 72.91 / 85.11 74.18 74.65 83.23 76.42
16K 73.42 / 85.75 74.46 75.15 83.3 76.41
32K 74.04 / 86.30 74.74 74.29 83.02 77.01
64K 74.04 / 86.66 73.73 74.55 83.52 77.47
Morpheme-aware Subword 4K 67.53 / 81.93 73.53 73.45 83.34 76.03
8K 70.90 / 84.57 74.14 73.95 83.71 76.07
16K 69.47 / 83.36 75.02 74.99 83.22 76.59
32K 72.65 / 86.35 74.1 75.13 83.65 78.11
64K 69.48 / 83.73 76.39 76.61 84.29 76.78
Word 64K 1.54 / 8.86 64.06 65.83 69 60.41

Pre-training

For each tokenization strategy, pre-training of BERT-Base model (Devlin et al., 2019) was performed with a Cloud TPU v3-8 for 1M steps using the official code of google-research/bert.

We set the training hyper-parameters of all models as follows: batch_size=1024, max_sequence_length=128, learning_rate=5e-5, warm_up_steps=10000.

Because the Korean Wiki corpus (640 MB) is not enough in volume for the pre-training purpose, we additionally downloaded the recent dump of Namuwiki (5.5 GB) and extracted plain texts using Namu Wiki Extractor.

Fine-tuning

After converting each pre-trained model in TensorFlow into PyTorch, we fine-tuned them using huggingface/transformers (Wolf et al., 2019).

example:

python tasks/<TASK_NAME>/run_train.py --tokenizer <TOKENIZER_NAME>

Citation

@article{park2020empirical,
  title={An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks},
  author={Park, Kyubyong and Lee, Joohong and Jang, Seongbo and Jung, Dawoon},
  journal={arXiv preprint arXiv:2010.02534},
  year={2020}
}

Acknowledgements

For pre-training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

kortok's People

Contributors

kyubyong avatar noowad93 avatar roomylee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kortok's Issues

Wiki, namu wiki 데이터에 대해 질문드립니다.

안녕하세요!
코드를 공유해주셔서 정말 감사드립니다.

다름이 아니라, 논문에서 사용하신 데이터셋을 바탕으로 저희의 모델을 평가해 보고 싶은데요,
위키, 나무위키 덤프 데이터의 날짜가 정확히 언제인지 알 수 없어서 문의 드립니다.

  • dataset 폴더에 있는 샘플 데이터는 200420이라고 되어 있는데 혹시 200420 덤프 파일로 학습을 진행하신 것이 맞을까요?
  • 나무위키 덤프 파일의 날짜는 어떻게 되는지요?

그럼 좋은 하루 보내시기 바랍니다!
감사합니다!

Accessing Korean parallel dataset in United States

Dear Authors,

Thanks for providing the dataset link regarding the Korean parallel dataset from the AI hub. However, The users from the United States cannot pass the phone verification at the AI hub so we cannot download the dataset. Could you kindly share the downloaded data with my email [email protected]? Or it is also good to share a link to download it.

Thanks,
Ken Lou

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.