CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models
-
EMNLP-Findings 2020 Accepted.
-
AI-Open 2021 Accepted.
- CokeBert
- CokeRoberta
- CokeBert
- CokeRoberta (will release soon)
- pytorch
- transformers
- tqdm
- boto3
- requests
- You need to download Knowledge Embedding (including entity and relation to id information) and knowledge graph neighbor information from here. Put them in the
data/pretrain
folder and unzip them.
cd data/pretrain
# Download the files
tar zxvf kg_embed.tar.gz
tar zxvf kg_neighbor.tar.gz
cd ../..
- Then, you could obtain pre-trained checkpoints from here and directly use CokeBert.
from coke import CokeBertForPreTraining
model = CokeBertForPreTraining.from_pretrained('checkpoint/coke-bert-base-uncased', neighbor_hop=2)
If you want to pre-train CokeBert with different corpus and knowledge graphs, you could read the following instructions.
- Go to the folder for the latest version. Choose a backbone model, e.g.
bert-base-uncased
cd CokeBert-2.0-latest
- We will provide dataset for pre-training. If you want to use the latest data, pleas follow the ERNIE pipline to pre-process your data, using the corresponding tokenizer of the backbone model. The outputs are
merbe.bin
andmerge.idx
. After pre-process Pre-trained data, move them to the corresopinding directory.
export BACKBONE=bert-base-uncased
export HOP=2
mkdir data/pretrain/$BACKBONE
mv merge.bin data/pretrain/$BACKBONE
mv mergr.idx data/pretrain/$BACKBONE
- Download the backbone model checkpoint from Huggingface, and move it to the corresponding checkpoint folder for pre-training. Note do not download the
config.json
for the backbone model, since we will be using the config ofcoke
.
wget https://huggingface.co/$BACKBONE/resolve/main/vocab.txt -O checkpoint/coke-$BACKBONE/vocab.txt
wget https://huggingface.co/$BACKBONE/resolve/main/pytorch_model.bin -O checkpoint/coke-$BACKBONE/pytorch_model.bin
- Knowledge Embedding (including entity and relation to id information) and knowledge graph neighbor information from here. Put them in the
data/pretrain
folder and unzip them.
cd data/pretrain
# Download the files
tar zxvf kg_embed.tar.gz
tar zxvf kg_neighbor.tar.gz
cd ../..
- (Optional) Generate Knowledge Graph Neighbors. We have provided this data. If you want to change the max number of neighbors, you can run this code to get the new
kg_neighbor
data
cd data/pretrain
python3 preprocess_n.py
cd examples
bash run_pretrain.sh
It will write log and checkpoint to ./outputs
. Check src/coke/training_args.py
for more arguments.
- As most datasets except FewRel don not have entity annotations, we use the annotated dataset from ERNIE. Downlaod them from data. Then, please unzip and save them (data) to the corresopinding dir.
unzip data.zip -d data/finetune
- After pre-training the Coke model, move pytorch_model.bin to the corresponding dir DKPLM/data/DKPLM_BERTbase_2layer DKPLM/data/DKPLM_RoBERTabase_2layer
export BACKBONE=bert-base-uncased
export HOP=2
mv outputs/pretrain_coke-$BACKBONE-$HOP/pytorch_model.bin ../checkpoint/coke-$BACKBONE/pytorch_model.bin
bash run_finetune.sh
Please cite our paper if you use CokeBert in your work:
@article{SU2021,
title = {CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models},
author = {Yusheng Su and Xu Han and Zhengyan Zhang and Yankai Lin and Peng Li and Zhiyuan Liu and Jie Zhou and Maosong Sun},
journal = {AI Open},
year = {2021},
issn = {2666-6510},
doi = {https://doi.org/10.1016/j.aiopen.2021.06.004},
url = {https://arxiv.org/abs/2009.13964},
}