Code Monkey home page Code Monkey logo

chinesegec's Introduction

Chinese Grammatical Correction Using BERT-based Pre-trained Model

Code for BERT-encoder introduced in the AACL-IJCNLP2020 paper Chinese Grammatical Correction Using BERT-based Pre-trained Model. If you find this work helpful in your research, please cite as:

@inproceedings{wang-etal-2020-chinese-grammatical-correction,
    title = "{C}hinese Grammatical Correction Using {BERT}-based Pre-trained Model",
    author = "Wang, Hongfei  and
      Kurosawa, Michiki  and
      Katsumata, Satoru  and
      Komachi, Mamoru",
    booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    year = "2020",
    url = "https://www.aclweb.org/anthology/2020.aacl-main.20"
}

Requirements

  • python >= 3.5
  • PyTorch == 1.1.0
  • pytorch_transformers >= 1.2.0
  • other depencies required by fairseq

How to use

Download the data and the pre-trained model

The official NLPCC2018_GEC training and test data can be downloaded from here. In the official training data, one error sentences may have multiple corrections, we divide them into seperate parts. Since there are no official development data, we randomly extract 5,000 sentences from the training data as the development data. We also segment all sentences into characters. Here is our preprocessed data.

For pre-trained model, see here. The pre-trained model we used in our work is RoBERTa-wwm-ext, Chinese.

Use fairseq command to turn the data into binary datasets

python preprocess.py  --user-dir ./user  \
--task bert_translation \
--srcdict $PATH_TO_PRE_TRAINED_MODEL/vocab.txt  \
-s src -t trg \
--bert-name $PATH_TO_PRE_TRAINED_MODEL \
--destdir $DATA_BIN_DIR  \
--trainpref $PATH_TO_DATA/train \
--validpref $PATH_TO_DATA/dev \
--testpref $PATH_TO_DATA/test

Train the model

python train.py $DATA_BIN_DIR \
--seed 1 --user-dir ./user/ \
--optimizer adam --bert-name  $PATH_TO_PRETRAINED_MODEL \
--task bert_translation --batch-size 32 \
--arch bert_nmt --max-epoch 30 \
--save-dir  $SAVED_MODEL \
-s src -t trg --lr 0.00003

Generate

python generate.py $DATA_BIN_DIR \
--task bert_translation  -s src -t trg  --user-dir ./user \
--path $SAVED_MODEL/checkpoint_best.pt --batch-size 32 \
--beam 12  > $OUTPUT_PATH/bert_encoder.txt

Evaluate

Download the PKUNLP word segmentation tools, and m2scorer. Use fairseq_reformat.py to get the raw text output.

python fairseq_reformat.py $OUTPUT_PATH/bert_encoder.txt

Delete the space and [UNK] tokens.

sed 's/<unk>//g' < $OUTPUT_PATH/bert_encoder.txt.reformat > bert_encoder.nounk.txt
sed 's/ //g' < bert_encoder.nounk.txt > bert_encoder.nospac.txt

Segment the sentence into words using PKUNLP word segmentation tools.

cd libgrass-ui
python pkunlp_segment.py --corpus $PATH/bert_encoder.nospac.txt

Evaluate using m2scorer. The gold edit file is in official datasets.

python $PATH/m2scorer.py $PATH/bert_encoder.nospac.txt.seg $PATH/gold/gold.01

chinesegec's People

Contributors

alexeib avatar cndn avatar davidecaroselli avatar edunov avatar freewym avatar halilakin avatar hitvoice avatar hmc-cs-mdrissi avatar huihuifan avatar jhcross avatar jingfeidu avatar jma127 avatar kartikayk avatar lematt1991 avatar liezl200 avatar louismartin avatar michaelauli avatar myleott avatar nng555 avatar pipibjc avatar rutyrinott avatar skritika avatar stephenroller avatar taineleau avatar taylanbil avatar theweiho avatar vineetk1 avatar wang136906578 avatar xianxl avatar zhiqwang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.