Code Monkey home page Code Monkey logo

cokebert's Introduction

CokeBert

CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models

  • EMNLP-Findings 2020 Accepted.

  • AI-Open 2021 Accepted.

Overview

Figure: The example of capturing knowledge contextfrom a KG and incorporating them for language understanding. Different sizes of circles express different entity importance for understanding the given sentence.

Code Version

v1.0

  • CokeBert
  • CokeRoberta

v2.0

  • CokeBert
  • CokeRoberta (will release soon)

Reqirements:

  • pytorch
  • transformers
  • tqdm
  • boto3
  • requests

How to Use

  • You need to download Knowledge Embedding (including entity and relation to id information) and knowledge graph neighbor information from here. Put them in the data/pretrain folder and unzip them.
cd data/pretrain

# Download the files

tar zxvf kg_embed.tar.gz
tar zxvf kg_neighbor.tar.gz

cd ../..
  • Then, you could obtain pre-trained checkpoints from here and directly use CokeBert.
from coke import CokeBertForPreTraining

model = CokeBertForPreTraining.from_pretrained('checkpoint/coke-bert-base-uncased', neighbor_hop=2)

If you want to pre-train CokeBert with different corpus and knowledge graphs, you could read the following instructions.

Pre-training

Prepare Pre-training Data

  • Go to the folder for the latest version. Choose a backbone model, e.g. bert-base-uncased
cd CokeBert-2.0-latest
  • We will provide dataset for pre-training. If you want to use the latest data, pleas follow the ERNIE pipline to pre-process your data, using the corresponding tokenizer of the backbone model. The outputs are merbe.bin and merge.idx. After pre-process Pre-trained data, move them to the corresopinding directory.
export BACKBONE=bert-base-uncased
export HOP=2

mkdir data/pretrain/$BACKBONE

mv merge.bin data/pretrain/$BACKBONE
mv mergr.idx data/pretrain/$BACKBONE
  • Download the backbone model checkpoint from Huggingface, and move it to the corresponding checkpoint folder for pre-training. Note do not download the config.json for the backbone model, since we will be using the config of coke.
wget https://huggingface.co/$BACKBONE/resolve/main/vocab.txt -O checkpoint/coke-$BACKBONE/vocab.txt
wget https://huggingface.co/$BACKBONE/resolve/main/pytorch_model.bin -O checkpoint/coke-$BACKBONE/pytorch_model.bin
  • Knowledge Embedding (including entity and relation to id information) and knowledge graph neighbor information from here. Put them in the data/pretrain folder and unzip them.
cd data/pretrain

# Download the files

tar zxvf kg_embed.tar.gz
tar zxvf kg_neighbor.tar.gz

cd ../..
  • (Optional) Generate Knowledge Graph Neighbors. We have provided this data. If you want to change the max number of neighbors, you can run this code to get the new kg_neighbor data
cd data/pretrain
python3 preprocess_n.py

Excute Pre-training

cd examples
bash run_pretrain.sh

It will write log and checkpoint to ./outputs. Check src/coke/training_args.py for more arguments.

Fine-tuning

Fine-tuning Data

  • As most datasets except FewRel don not have entity annotations, we use the annotated dataset from ERNIE. Downlaod them from data. Then, please unzip and save them (data) to the corresopinding dir.
unzip data.zip -d data/finetune

Excute Fine-tuning

  • After pre-training the Coke model, move pytorch_model.bin to the corresponding dir DKPLM/data/DKPLM_BERTbase_2layer DKPLM/data/DKPLM_RoBERTabase_2layer
export BACKBONE=bert-base-uncased
export HOP=2

mv outputs/pretrain_coke-$BACKBONE-$HOP/pytorch_model.bin ../checkpoint/coke-$BACKBONE/pytorch_model.bin

FewRel/Figer/Open Entity/TACRED

bash run_finetune.sh

Citation

Please cite our paper if you use CokeBert in your work:

@article{SU2021,
title = {CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models},
author = {Yusheng Su and Xu Han and Zhengyan Zhang and Yankai Lin and Peng Li and Zhiyuan Liu and Jie Zhou and Maosong Sun},
journal = {AI Open},
year = {2021},
issn = {2666-6510},
doi = {https://doi.org/10.1016/j.aiopen.2021.06.004},
url = {https://arxiv.org/abs/2009.13964},
}

Contact

Yusheng Su

Mail: [email protected]; [email protected]

cokebert's People

Contributors

chengjiali avatar trellixvulnteam avatar yushengsu-thu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.