thunlp / ernie Goto Github PK

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

License: MIT License

Python 100.00%

ernie's Introduction

ERNIE (sub-project of OpenSKL)

ERNIE is a sub-project of OpenSKL, providing an open-sourced toolkit (Enhanced language RepresentatioN with Informative Entities) for augmenting pre-trained language models with knowledge graph representations.

Overview

ERNIE contains the source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities", and is an effective and efficient toolkit for augmenting pre-trained language models with knowledge graph representations.

Models

We provide our knowledge-enhanced pre-trained language model ERNIE in this toolkit. We also provide the detailed commands to fine-tune ERNIE for different downstream tasks.

Evaluation

We validate the effectiveness of ERNIE on entity typing and relation classification tasks through fine-tuning.

Settings

We use the following datasets: FIGER and OpenEntity for entity typing, FewRel and TACRED for relation classification. We will fine-tune the models (BERT and ERNIE) first, and then evaluate their accuracies and F1 scores.

Results

Here we report the main results on the above datasets. From this table, we observe that ERNIE effectively improves the performance of BERT on these knowledge-driven tasks.

	FIGER	OpenEntity	FewRel	TACRED
	Acc.	F1	F1	F1
BERT	52.04	73.56	84.89	66.00
ERNIE	57.19	75.56	88.32	67.97

Usage

Requirements:

Pytorch>=0.4.1
Python3
tqdm
boto3
requests
apex (If you want to use fp16, you should make sure the commit is 79ad5a88e91434312b43b4a89d66226be5f2cc98.)

Prepare Pre-train Data

Run the following command to create training instances.

  # Download Wikidump
  wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
  # Download anchor2id
  wget -c https://cloud.tsinghua.edu.cn/f/6318808dded94818b3a1/?dl=1 -O anchor2id.txt
  # WikiExtractor
  python3 pretrain_data/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
  # Modify anchors with 4 processes
  python3 pretrain_data/extract.py 4
  # Preprocess with 4 processes
  python3 pretrain_data/create_ids.py 4
  # create instances
  python3 pretrain_data/create_insts.py 4
  # merge
  python3 code/merge.py

If you want to get anchor2id by yourself, run the following code(this will take about half a day) after python3 pretrain_data/extract.py 4

  # extract anchors
  python3 pretrain_data/utils.py get_anchors
  # query Mediawiki api using anchor link to get wikibase item id. For more details, see https://en.wikipedia.org/w/api.php?action=help.
  python3 pretrain_data/create_anchors.py 256 
  # aggregate anchors 
  python3 pretrain_data/utils.py agg_anchors

Run the following command to pretrain:

  python3 code/run_pretrain.py --do_train --data_dir pretrain_data/merge --bert_model bert_base --output_dir pretrain_out/ --task_name pretrain --fp16 --max_seq_length 256

We use 8 NVIDIA-2080Ti to pre-train our model and there are 32 instances in each GPU. It takes nearly one day to finish the training (1 epoch is enough).

Pre-trained Model

Download pre-trained knowledge embedding from Google Drive/Tsinghua Cloud and extract it.

tar -xvzf kg_embed.tar.gz

Download pre-trained ERNIE from Google Drive/Tsinghua Cloud and extract it.

tar -xvzf ernie_base.tar.gz

Note that the extraction may be not completed in Windows.

Fine-tuning

As most datasets except FewRel don't have entity annotations, we use TAGME to extract the entity mentions in the sentences and link them to their corresponding entitoes in KGs. We provide the annotated datasets Google Drive/Tsinghua Cloud.

tar -xvzf data.tar.gz

In the root directory of the project, run the following codes to fine-tune ERNIE on different datasets.

FewRel:

python3 code/run_fewrel.py   --do_train   --do_lower_case   --data_dir data/fewrel/   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 10   --output_dir output_fewrel   --fp16   --loss_scale 128
# evaluate
python3 code/eval_fewrel.py   --do_eval   --do_lower_case   --data_dir data/fewrel/   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 10   --output_dir output_fewrel   --fp16   --loss_scale 128

TACRED:

python3 code/run_tacred.py   --do_train   --do_lower_case   --data_dir data/tacred   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 4.0   --output_dir output_tacred   --fp16   --loss_scale 128 --threshold 0.4
# evaluate
python3 code/eval_tacred.py   --do_eval   --do_lower_case   --data_dir data/tacred   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 4.0   --output_dir output_tacred   --fp16   --loss_scale 128 --threshold 0.4

FIGER:

python3 code/run_typing.py    --do_train   --do_lower_case   --data_dir data/FIGER   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 2048   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output_figer  --gradient_accumulation_steps 32 --threshold 0.3 --fp16 --loss_scale 128 --warmup_proportion 0.2
# evaluate
python3 code/eval_figer.py    --do_eval   --do_lower_case   --data_dir data/FIGER   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 2048   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output_figer  --gradient_accumulation_steps 32 --threshold 0.3 --fp16 --loss_scale 128 --warmup_proportion 0.2

OpenEntity:

python3 code/run_typing.py    --do_train   --do_lower_case   --data_dir data/OpenEntity   --ernie_model ernie_base   --max_seq_length 128   --train_batch_size 16   --learning_rate 2e-5   --num_train_epochs 10.0   --output_dir output_open --threshold 0.3 --fp16 --loss_scale 128
# evaluate
python3 code/eval_typing.py   --do_eval   --do_lower_case   --data_dir data/OpenEntity   --ernie_model ernie_base   --max_seq_length 128   --train_batch_size 16   --learning_rate 2e-5   --num_train_epochs 10.0   --output_dir output_open --threshold 0.3 --fp16 --loss_scale 128

Some code is modified from the pytorch-pretrained-BERT. You can find the explanation of most parameters in pytorch-pretrained-BERT.

As the annotations given by TAGME have confidence score, we use --threshlod to set the lowest confidence score and choose the annotations whose scores are higher than --threshold. In this experiment, the value is usually 0.3 or 0.4.

The script for the evaluation of relation classification just gives the accuracy score. For the macro/micro metrics, you should use code/score.py which is from tacred repo.

python3 code/score.py gold_file pred_file

You can find gold_file and pred_file on each checkpoint in the output folder (--output_dir).

New Tasks

If you want to use ERNIE in new tasks, you should follow these steps:

Use an entity-linking tool like TAGME to extract the entities in the text
Look for the Wikidata ID of the extracted entities
Take the text and entities sequence as input data

Here is a quick-start example (code/example.py) using ERNIE for Masked Language Model. We show how to annotate the given sentence with TAGME and build the input data for ERNIE. Note that it will take some time (around 5 mins) to load the model.

# If you haven't installed tagme
pip install tagme
# Run example
python3 code/example.py

Citation

If you use the code, please cite this paper:

@inproceedings{zhang2019ernie,
  title={{ERNIE}: Enhanced Language Representation with Informative Entities},
  author={Zhang, Zhengyan and Han, Xu and Liu, Zhiyuan and Jiang, Xin and Sun, Maosong and Liu, Qun},
  booktitle={Proceedings of ACL 2019},
  year={2019}
}

About OpenSKL

OpenSKL project aims to harness the power of both structured knowledge and natural languages via representation learning. All sub-projects of OpenSKL, under the categories of Algorithm, Resource and Application, are as follows.

Algorithm:
- OpenKE
  - An effective and efficient toolkit for representing structured knowledge in large-scale knowledge graphs as embeddings, with TransR and PTransE as key features to handle complex relations and relational paths.
  - This toolkit also includes three repositories:
- ERNIE
  - An effective and efficient toolkit for augmenting pre-trained language models with knowledge graph representations.
- OpenNE
  - An effective and efficient toolkit for representing nodes in large-scale graphs as embeddings, with TADW as key features to incorporate text attributes of nodes.
- OpenNRE
  - An effective and efficient toolkit for implementing neural networks for extracting structured knowledge from text, with ATT as key features to consider relation-associated text information.
  - This toolkit also includes two repositories:
    - JointNRE
    - NRE
Resource:
- The embeddings of large-scale knowledge graphs pre-trained by OpenKE, covering three typical large-scale knowledge graphs: Wikidata, Freebase, and XLORE. The embeddings are free to use under the MIT license, and please click the following link to submit download requests.
- OpenKE-Wikidata
  - Wikidata is a free and collaborative database, collecting structured data to provide support for Wikipedia. The original Wikidata contains 20,982,733 entities, 594 relations and 68,904,773 triplets. In particular, Wikidata-5M is the core subgraph of Wikidata, containing 5,040,986 high-frequency entities from Wikidata with their corresponding 927 relations and 24,267,796 triplets.
  - TransE version: Knowledge embeddings of Wikidata pre-trained by OpenKE.
  - TransR version of Wikidata-5M: Knowledge embeddings of Wikidata-5M pre-trained by OpenKE.
- OpenKE-Freebase
  - Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members. It was an online collection of structured data harvested from many sources. Freebase contains 86,054,151 entities, 14,824 relations and 338,586,276 triplets.
  - TransE version: Knowledge embeddings of Freebase pre-trained by OpenKE.
- OpenKE-XLORE
  - XLORE is one of the most popular Chinese knowledge graphs developed by THUKEG. XLORE contains 10,572,209 entities, 138,581 relations and 35,954,249 triplets.
  - TransE version: Knowledge embeddings of XLORE pre-trained by OpenKE.
Application:
- Knowledge-Plugin
  - An effective and efficient toolkit of plug-and-play knowledge injection for pre-trained language models. Knowledge-Plugin is general for all kinds of knowledge graph embeddings mentioned above. In the toolkit, we plug the TransR version of Wikidata-5M into BERT as an example of applications. With the TransR embedding, we enhance the knowledge ability of BERT without fine-tuning the original model, e.g., up to 8% improvement on question answering.

ernie's People

Contributors

Stargazers

Watchers

Forkers

allensmile mmarius ml-lab little1tow tj1116 awesome-archive michael-wzhu xcgfth dark-noisy-py iamsile samangel93 scape1989 kentchun33333 auscenery liu-nlper lvcheer fengzifrank hydercps akim xiaonan07 gdh756462786 yonatanmedan liuwq168 davysoft yucoian ianliyi1996 zhiwuya sunyilgdx hongshengxin yexm nunofernandes-plight codemanyep leilee07 zqhfpjlswsqy chcbin luolanfeixue grzegorzwarzecha tomzhang amerssun wangxuekui systeminn bcmi220 cnfive ankur-gos lvchigo hanzhenlei767 tguens songyue5756 vangogh0318 cenjat zxw866 robets2020 ddonng bigheiniu sisiruowan gaohuan2015 ledw kunlun-zhu 1250483717 zjersey pvk444 nangeblog rollend arita37 yangyang233 cr1024 lichao88 mars-wei lczd janciswang xiaodanjiao senkey705 askintution sduchh jasonhoou usccolumbia euphoriayan shilonosov fishredleaf zhongyunuestc zhoudaozhuihou kourenmu lrxzhy rgib37190 wurentidai entn-at siftxxx toughhou pfbalan ptolemyre tony1236 ilyagusev williamlizl shiweiba bailianfa beambin 91916117qq blackhandlyh dav009 ufomeiyi

ernie's Issues

Pretrain

How to pretrain the model? I run

python3 code/run_pretrain.py --do_train --data_dir pretrain_data/ --bert_model ernie_base --output_dir pretrain_out/ --task_name pretrain --fp16

but get error message

Traceback (most recent call last):
File "code/run_pretrain.py", line 423, in
main()
File "code/run_pretrain.py", line 320, in main
loss, original_loss = model(input_ids, segment_ids, input_mask, masked_lm_labels, input_ent.half(), ent_mask, next_sentence_label, ent_candidate.half(), ent_labels)
File ".local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "ERNIE/code/knowledge_bert/modeling.py", line 839, in forward
next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1))
File "local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File ".local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 942, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "local/lib/python3.6/site-packages/torch/nn/functional.py", line 2056, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "local/lib/python3.6/site-packages/torch/nn/functional.py", line 1869, in nll_loss
.format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (32) to match target batch_size (24608).

Where the value 24608 comes from? How can I run pretraining? Thanks.

Using finetuning scripts with BERT

Hi 👋,

Thank you for the great work!

I'm trying to replicate the BERT baseline for downstream tasks. Is it possible to load BERT pre-trained instead of ERNIE pre-trained model in the fine-tuning code? If not, can you provide some pointers to the code you used for baseline?

I pointed the --ernie-model to BERT pre-trained but I got this error

Traceback (most recent call last):
File "code/run_typing.py", line 573, in main()
File "code/run_typing.py", line 511, in main
train_examples, label_list, args.max_seq_length, tokenizer_label, tokenizer, args.threshold)
File "code/run_typing.py", line 168, in convert_examples_to_features
tokens_a, entities_a = tokenizer_label.tokenize(ex_text_a, [h])
AttributeError: 'NoneType' object has no attribute 'tokenize'

Please let me know if I miss something here?

Thank you!
June

Training from scratch

Did you consider any other knowledge embedding model, such as transH or others.

As we know, transE can not represent complicate relation such as 1 to multiple, multiple to 1 and multiple to multiple.

thx a lot!

调试测试的时候出现了一些问题

烦请修改下代码。还有就是文件解压出来的时候路径不正确，请修改md
ps:是否是因为torch版本问题？

可以提供国内下载源吗？

你好，我看你的预训练模型及kg_embed放在google drive上，由于这个文件都很大且国内访问goole drive都不是很顺畅，能提供国内的下载源吗？谢谢！

Why entity embedding is invoked in the 6th layer of BERT encoder?

Is it a practical choice?

and is it depends on the scale of entity amount?

Thanks a lot!

Hello, when I was testing example.py, I encountered a BUG:for a in ann.get_annotations(0.3): AttributeError: 'NoneType' object has no attribute 'get_annotations'

I am a novice, hoping you can give me more advice

fine-tune

Thinks for your upload code
I wanna ask you why I fine-tune fewrel datset , it work. butthe part of model parameter stored is nan

Does the pre-trained model has Chinese version?

Is there an Chinese version pre-trained model being released?

Can I used pretrained ERNIE with bert code for tasks like GLUE?

Which entities are sampled for pre-training TransE KGembedding?

Thanks for uploading code.
I have quesition about paper,

we sample part of Wikidata which contains 5, 040, 986 entities and 24, 267, 796 fact triples.

Actually I previously asked this, but what I'd like to know is

How entities are sampled from wikidata?
How many entities are sampled from wikidata?
How wikidata's entity and wikipedia entity are aligned?

If you'd know about these or where in codes this sampling part exists, I'd appreciate it much.
Thanks.

支持中文NLP任务吗？

现有的预训练权重，支持中文NLP任务，比如中文关系抽取吗？

Entity question？

你好，
我在做命名实体识别实验想借助你的提供的kg_embed.zip，作为一种辅助识别，
但是kg_embed.zip里面的实体没有实体的类别(比如人名、地名。组织名等)，请问一下怎么对kg_embed.zip进行分类呢？

Errors when I run the .py

07/01/2019 14:22:40 - ERROR - knowledge_bert.tokenization - Model name 'ernie_base' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-baert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'ernie_base' was a path or url but couldn't find any fild to this path or url.

07/01/2019 14:22:41 - ERROR - knowledge_bert.modeling - Model name 'ernie_base' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-clarge-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'ernie_base' was a path or url but couldn't find any file as this path or url.

复现tacred的finetune,准确率和论文相差很大

您好！我尝试复现ERNIE在tacred上的fine-tuning，论文里给出的P R F1分别是69.97 66.08 67.97，但是我得到的结果是
Final Score:Precision (micro): 87.648%
Recall (micro): 87.642%
F1 (micro): 87.645%
因为我安装apex一直失败，故没有用apex,只将代码中的.half()改为.float().其他完全按照readme上的步骤进行的操作。
（1）train：python code/run_tacred.py --do_train --do_lower_case --data_dir data/tacred --ernie_model ernie_base --max_seq_length 256 --train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 4.0 --output_dir output_tacred --loss_scale 128 --threshold 0.4
（2）eval：python code/eval_tacred.py --do_eval --do_lower_case --data_dir data/tacred --ernie_model ernie_base --max_seq_length 256 --train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 4.0 --output_dir output_tacred --loss_scale 128 --threshold 0.4
（3）score：python code/score.py output_tacred/test_gold_model.bin.txt output_tacred/test_pred_model.bin.txt
以上是我得到结果的完整命令。
想请教一下是什么原因。

有发布中文ERNIE预训练模型的计划吗？

Can you explain more about the pretrained files?

entity_map.txt and entity2id.txt and entity2vec.vec.
Why there is a different size between entity_map.txt (3493994) and entity2vec.vec (5040986)? So how i extract entity (word) vectors?

BertEmbedding output nan inf

I tried to get the Ernie embedding for my dataset. But When I use the BertModel, I found that after a batch, the word embedding in BertEmbedding becomes full of nan and inf, which was initialized normally.
I tried to minimize the batch size and learning rate, but it failed. I don't know how to solve it.

Details about training pre-trained embeddings using TransE

Thanks for developing/opening code.
I have a question about the original paper.

In the original paper,

To be specific, we sample part of Wikidata which contains 5, 040, 986 entities and 24, 267, 796 fact triples.

Is there any plan for giving description about sampling facts and training embeddings?
For example, how long it takes, how to tune, and so on.

Share the triples used to train knowledge embeddings

HI, thanks to this great work.

I read previous issue that you sampled from the original Wikidata knowledge graph and use this subgraph to train TransE. Can you share the subgraph used to train the knowledge embeddings? I want to train the knowledge embedding myself. Thank you.

Why not add relation classification into pretrained task?

Thanks for your code. Why not add relation classification during pretraining?
And do u consider multi-relation in one sentence? In most cases, one sentence could contain multi-relationships between different entities. Because I see that your example in your paper is two entity relationship.

Which version of apex(commit) have you used

Hi,
I was trying to replicate the experiments but failed to do so because of due to a version mismatch of apex. Would be grateful if you could let me know the exact commit id for this.

Missing numbers using wikiextractor

Hey,

Thanks for the nice work. I just wanted to point to some open issue of wikiextractor, in case you are not aware of it: attardi/wikiextractor#189

Some numbers are missing in the output. Here is an example:

Andorra is the <a href="European%20microstates">sixth-smallest nation in Europe</a>, having an area of and a population of approximately .

Instead of:

Andorra is the <a href="European%20microstates">sixth-smallest nation in Europe</a>, having an area of 468 square kilometers (181 sq mi) and a population of approximately 77,006

Are the published results based on a wiki corpus with missing numbers or is it a recent bug?

expected backend CUDA and dtype Float but got backend CUDA and dtype Half

I run the command like this: python code/run_pretrain.py --do_train --data_dir pretrain_data/sample --bert_model ernie_base --output_dir pretrain_out/ --task_name pretrain --max_seq_length 256

However, it seems to have some mistake:
raceback (most recent call last): | 0/10 [00:00<?, ?it/s] File "code/run_pretrain.py", line 421, in <module> main() File "code/run_pretrain.py", line 320, in main loss, original_loss = model(input_ids, segment_ids, input_mask, masked_lm_labels, input_ent, ent_mask, next_sentence_label, ent_candidate, ent_labels) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/data/desmon/gitproject/ERNIE/code/knowledge_bert/modeling.py", line 833, in forward output_all_encoded_layers=False) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/data/desmon/gitproject/ERNIE/code/knowledge_bert/modeling.py", line 765, in forward output_all_encoded_layers=output_all_encoded_layers) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/data/desmon/gitproject/ERNIE/code/knowledge_bert/modeling.py", line 443, in forward hidden_states, hidden_states_ent = layer_module(hidden_states, attention_mask, hidden_states_ent, attention_mask_ent, ent_mask) File "/home/desmon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/data/desmon/gitproject/ERNIE/code/knowledge_bert/modeling.py", line 382, in forward attention_output_ent = hidden_states_ent * ent_mask RuntimeError: expected backend CUDA and dtype Float but got backend CUDA and dtype Half
I don't know which part that leads to this problem. Need help.

Questions about model inputs and model layers

Thank you for releasing this wonderful work.
What confused me are:

According to your model, if there is a entity contains more than one tokens, for example jim henson, the information fusion is only done at the first token, that is jim, is it better than doing information fusion both at jim and henson?
I draw three figures of your model layers(sim, mix and norm):

it seems that the information fusion appears before multihead attentions of entities(mix layer), which is different from your aggregator in Figure 2.
I'm a little confused about the alignment in Figure 3, I didn't read this part of codes, what is the value of placeholder, an all zeros vecotr or an embedding vector learning from trainings?

same model name with baidu, what difference or some advantages

https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE

请问用什么设备训练的？训练了多久？

About BERT

I try modify origin BERT SQUAD task for entity typing. For a sentence like, "left_context_token + mention_span + right_context_token', I adapt the whole context and mention_span as inputs. The input format is '[CLS] context_tokens [SEP] mention_tokens [SEP] [PAD]*'.
The network structure is just same as origin SQUAD task (BERT pooled_out + FC). I evaluate my BERT model in OpenEntity dataset and get something different from your paper.
May I know something about your BERT implementation?

What is the meaning of entity column

in the train.csv. the typical format of an entity is
[['Q8029103', 139, 143, 0.5], [......]]
Here 'Q8029103' is the identifier of the entity.
what is the meaning for 139, 143, 0.5 ?

Seven undefined names

flake8 testing of https://github.com/thunlp/ERNIE on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./code/eval_figer.py:559:36: F821 undefined name 'global_step'
                    'global_step': global_step,
                                   ^
./code/eval_figer.py:560:29: F821 undefined name 'tr_loss'
                    'loss': tr_loss/nb_tr_steps,
                            ^
./code/eval_figer.py:560:37: F821 undefined name 'nb_tr_steps'
                    'loss': tr_loss/nb_tr_steps,
                                    ^
./code/run_typing.py:293:16: F821 undefined name 'x'
            if x[i] > 0:
               ^
./code/eval_typing.py:559:36: F821 undefined name 'global_step'
                    'global_step': global_step,
                                   ^
./code/eval_typing.py:560:29: F821 undefined name 'tr_loss'
                    'loss': tr_loss/nb_tr_steps,
                            ^
./code/eval_typing.py:560:37: F821 undefined name 'nb_tr_steps'
                    'loss': tr_loss/nb_tr_steps,
                                    ^
7     F821 undefined name 'global_step'
7

flake8 F821 issues have the potential to halt the runtime with a NameError.

Requirements.txt / Download Connection

Do we need requirements.txt file or no?

Also for the download files, the download connection from California, United States is really poor. It breaks often. It's not completing the download for me.

Thanks,

Is the result of ERNIE on glue from test set or dev set?

I do not find the ERNIE submissions when I check the glue leaderboard.

Thanks.

Pre-train Code

Thanks for your work. I want to use your model to train a Chinese version model in my dataset, I didn't find your pre-train code, do you release them?
Thanks.

Pre-training dataset format

Could you please give us an example about the format of pretraining corpus(such as .idx and .bin), since I want to use your model structure and my own corpus to retrain a new model. Thank you very much.

kg_embed.zip

你好:
我下载了kg_embed.zip，解压之后发现是entity2vec.vec，是实体性向量。请问在哪里找其对应实体名称呢？谢谢，我想转为成类似于glove形式。
期待你的回复......best

shell script for evaluation on OpenEntity dataset error

Hi,

I tried the evaluation shell script for dataset OpenEntity and got unusual result, so I doubt maybe the shell command my be "eval_typing.py --do_eval", not "run_typing.py --do_eval".

Can I invoke entities without KG embedding?

Hi,

If I have my own KG entities (e.g. about 500 focus on specific domain), can I just invoke entities to ERNIE without KG embedding.

Maybe I would make some modification to the model, making entity sequence aligned to token sequence and adding one embedding module.

I wonder whether my solution could have the similar result as ERNIE.

Thx a lot!

关于ERNIE中\knowledge_bert\file_utils.py中的import疑问

作者你好呀

from typing import Optional, Tuple, Union, IO, Callable, Set

在typing里没有这些变量或者是类？

其实我最主要的是想弄清楚，entity的token怎么和sentence的token进行对齐的。

我想在window下面试试example.py，不知道是不是系统的原因

where is ernie_base/vocab.txt?

hello! where is ernie_base/vocab.txt?

How does ERNIE perform compared with BERT + N-gram masking?

I wonder how your ERNIE performs compared with BERT + N-gram masking? Since the BERT model released by Google does not contain this training procedure, which has shown to be quite useful in SQUAD dataset.

Questions about OpenEntity

Hi, I found your OpenEntity Dataset only remains the general labels like 'person', 'location' and so on. I try to recover all the labels with the total amount of 10331. I use 'code/run_typing' for training just modify the variable 'label_list'. It seems that other running configs don't need to change.
But the result is quite strange, the macro score can only reach (0.0003825039934902759, 0.052011114612163914, 0.0007594229822059352).
Maybe some error in my process cause this huge bias, have you experimented on the complete ultra-fine data set?
Thanks a lot~

为啥glue任务会没有很多提升呢，理论上引入了这么多实体知识，怎么都会有提升吧

可能的原因是会是什么呢

Questions regarding Tagme

Hi!
I'm wondering how do you manage to label large corpus like Wikipedia using Tagme in short time? From my experience with Tagme API, the response time can be pretty slow when labeling a large collection of long articles and it may require days to fully annotate the Wikipedia corpus. Is there any "local" version of Tagme or am I simply missing something here? Any help would be appreciated.

Best regards

BERT BASE vs. LARGE

From what I see, you are using BERT BASE for ERNIE and also compare it in the paper.
What speaks against using BERT LARGE as this is significantly better. At least it would be good to have some arguments here and in the paper about this.

Can not find any code about "[HD] and [TL]".

The paper says that the tokens [HD] and [TL] are used to present head entities and tail entities respectively. But I can't find any code about this.

thx. a lot!

Preparation of ernie

If I have my own KG and labeled text. what I should do before I use Ernie to classify my text ?
how can I get the things such as
entity2vec
?

About pre-trained ERNIE's model file size

Hi,

I found the size of pytorch_model.bin inside the pre-trained ERNIE's model is only about 220M, while the BERT bert-base-uncased model file is about 400M.

I know the total amount of ERNIE parameters is about 114M, while BERT base is 110M. So I infer that ERNIE's model file should not be smaller than BERT's.

Would you pls. tell me whether I am wrong?