weizhepei / casrel Goto Github PK
View Code? Open in Web Editor NEWA Novel Cascade Binary Tagging Framework for Relational Triple Extraction. Accepted by ACL 2020.
Home Page: https://arxiv.org/abs/1909.03227
License: MIT License
A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. Accepted by ACL 2020.
Home Page: https://arxiv.org/abs/1909.03227
License: MIT License
想问下作者,为啥我能运行程序,但是却是使用CPU运行的
I get the following error while running the run.py
file
!python run.py --train=True --dataset=NYT
Using TensorFlow backend.
2020-07-03 14:29:11.236282: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-03 14:29:11.241179: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2200000000 Hz
2020-07-03 14:29:11.241354: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1e66a00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-03 14:29:11.241383: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-07-03 14:29:11.243340: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-03 14:29:11.248062: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-07-03 14:29:11.248099: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: e54567bf18c0
2020-07-03 14:29:11.248114: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: e54567bf18c0
2020-07-03 14:29:11.248169: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.67.0
2020-07-03 14:29:11.248200: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.67.0
2020-07-03 14:29:11.248212: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.67.0
train_data len: 56195
dev_data len: 4999
test_data len: 1297
Traceback (most recent call last):
File "run.py", line 40, in <module>
subject_model, object_model, hbt_model = E2EModel(bert_config_path, bert_checkpoint_path, LR, num_rels)
File "/content/CasRel/model.py", line 15, in E2EModel
bert_model = load_trained_model_from_checkpoint(bert_config_path, bert_checkpoint_path, seq_len=None)
File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 170, in load_trained_model_from_checkpoint
load_model_weights_from_checkpoint(model, config, checkpoint_file, training=training)
File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 114, in load_model_weights_from_checkpoint
loader('bert/encoder/layer_%d/output/dense/kernel' % i),
File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 18, in _loader
return tf.train.load_variable(checkpoint_file, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 85, in load_variable
return reader.get_tensor(name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 70, in get_tensor
self, compat.as_bytes(tensor_str))
IndexError: Read less bytes than requested
Please do let me know if there is any solution to this problem. Thanks in advance! :D
代码中使用了keras-bert里面的tokenizer,但是这个tokenize的表现好像有些特殊,例如:
"text":"三国中的谋士很多,但是谋士也要分不同的类别,有的善于统筹全局,有的善于战术规划,有的善于外交连横,不过说实话,其中大部分知名谋士的结局都不太好,如:荀彧被曹操逼死,陆逊被孙权气死,就连大家最敬仰的诸葛亮也是被军国大事给累死,但是有一个谋士不但得到了善终,而且还位高权重,关键就在于他在生涯中的五次站队都成功了,我们来看看吧",
"triple_list":[
[
"陆逊",
"朝代",
"三国"
]
]
}
这条数据,text就会被tokenize成,['[CLS]', '##三', '##国', ... , ‘[unused1]’, '[SEP]'], 对应的subject会被tokenize成,['##陆', '##逊', ‘[unused1]’],不知道是出于什么考虑还是只是bug?因为如果这样,在原始输入序列中无法找到subject与object对应的位置,就无法产生对应的标签。(bert-base-chinese, vocab也是bert自带的vocab.txt)
但是代码中在data_generator阶段似乎又有意规避了末尾的unused1标签?
for triple in line['triple_list']:
# 下面这个 -1,少取了末尾的token,是某种特殊的tokenize机制还是只是bug???
triple = (self.tokenizer.tokenize(triple[0])[1:-1], triple[1], self.tokenizer.tokenize(triple[2])[1:-1])
sub_head_idx = find_head_idx(tokens, triple[0])
obj_head_idx = find_head_idx(tokens, triple[2])
if sub_head_idx != -1 and obj_head_idx != -1:
sub = (sub_head_idx, sub_head_idx + len(triple[0]) - 1)
if sub not in s2ro_map:
s2ro_map[sub] = []
s2ro_map[sub].append((obj_head_idx,
obj_head_idx + len(triple[2]) - 1,
self.rel2id[triple[1]]))
另外,keras-bert 0.80.0似乎无法使用。
我的环境如下
keras == 2.4.3
keras-bert == 0.81.1
tensorflow-gpu == 1.13.1
综上,我的问题如下:
I notice that in the data_generator class, that you only preserve one relation triple by using random.choice(). what is the motivation of this?
Thanks
Feng
since after training, only hbt model was saved; then when test, how can we load the subject and object model?
hello~
我通读了你的论文和代码。有点想不通的就是inference这块。
subject中,
我看到,np.where(sub_heads_logits[0] > h_bar)[0], np.where(sub_tails_logits[0] > t_bar)[0]
取出来了所有的heads和tails
heads = [0,1]
tails = [4, 5]
最终组合的结果是
[0,4] [0,5], [1,4], [1,5]
怎么排除错误的subject?
而且预测object的时候
sub_heads, sub_tails = np.array([sub[1:] for sub in subjects]).T.reshape((2, -1, 1))
送入object_net的是多个头和尾的位置,seq_gather应该会有问题吧。按照你suject的循环,
obj_heads, obj_tails = np.where(obj_heads_logits[i] > h_bar), np.where(obj_tails_logits[i] > t_bar)
这里是通过i取出当前subject对应head或者tail的结果,但是这个obj_tails_logits貌似没有这一维度啊?
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
this shoud be change to
###os.environ["CUDA_VISIBLE_DEVICES"] = "1"
or
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
when error happen , or you should change it by your own device
Hi,
While trying to set up, facing the below issue just before, ! python run.py ---train=True --dataset=NYT
Looks like versioning issues, could you please share requirements.txt or used packages along with the specific versions.
raceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/keras/init.py", line 3, in
from tensorflow.keras.layers.experimental.preprocessing import RandomRotation
ModuleNotFoundError: No module named 'tensorflow.keras'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 3, in
from model import E2EModel, Evaluate
File "/content/CasRel/model.py", line 2, in
from keras.layers import *
File "/usr/local/lib/python3.6/dist-packages/keras/init.py", line 6, in
'Keras requires TensorFlow 2.2 or higher. '
ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via pip install tensorflow
model.py
line 52-65
gold_sub_heads = K.expand_dims(gold_sub_heads, 2)
gold_sub_tails = K.expand_dims(gold_sub_tails, 2)
sub_heads_loss = K.binary_crossentropy(gold_sub_heads, pred_sub_heads)
sub_heads_loss = K.sum(sub_heads_loss * mask) / K.sum(mask)
sub_tails_loss = K.binary_crossentropy(gold_sub_tails, pred_sub_tails)
sub_tails_loss = K.sum(sub_tails_loss * mask) / K.sum(mask)
obj_heads_loss = K.sum(K.binary_crossentropy(gold_obj_heads, pred_obj_heads), 2, keepdims=True)
obj_heads_loss = K.sum(obj_heads_loss * mask) / K.sum(mask)
obj_tails_loss = K.sum(K.binary_crossentropy(gold_obj_tails, pred_obj_tails), 2, keepdims=True)
obj_tails_loss = K.sum(obj_tails_loss * mask) / K.sum(mask)
loss = (sub_heads_loss + sub_tails_loss) + (obj_heads_loss + obj_tails_loss)
只看到了头尾实体的损失,关系的损失是和尾实体一起计算的,还是怎么计算的,在输入中也没有看到关系的信息,除了关系的数量
line 43-43
pred_obj_heads = Dense(num_rels, activation='sigmoid')(tokens_feature)
pred_obj_tails = Dense(num_rels, activation='sigmoid')(tokens_feature)
对于这块没有看太懂,请教下,谢谢。。
I reproduced a script to preprocess NYT dataset for joint entity and relation extraction. I aligned the dataset (CopyRE) to the origin NYT distant supervised learning dataset. No third-party tool is needed because all the sentences can be found in the origin NYT dataset.
Here is my script.
版本都是一致的,训练时报错,求解答,谢谢。
Can you give me some suggestions and some tips?
Thanks!
Looking forward to your reply
模型中subject_model和object_model预测的实体结尾tail的位置,就是实体结尾那个字的索引,但是在extract_items中,模型预测位置后,选择实体的时候使用的是:subject = tokens[sub_head: sub_tail]和obj = tokens[obj_head: obj_tail],这样选择的话,实体的最后一个字符不是没有选到结果里面吗?是不是应该用subject = tokens[sub_head: sub_tail+1]和obj = tokens[obj_head: obj_tail+1]
你好:
我在测试的时候选择了ExactMatch,但是结果还是实体只有头词。我查看了下训练数据和测试数据,好像所有实体都是头词,请问如果想看看ExactMatch,是否需要自己重新处理数据形成带有边界的训练数据,来完成ExactMatch? 谢谢
大神你好,请问表2中对于NYT和WebNLG这两个语料都是单个token的subj/obj的结果吗?还是说您在处理的时候只是取了subj/obj的head部分
The paper reports 89.4, 92.2, 94.7, on WebNLG-Normal, WebNLG-EPO, WebNLG-SEO. But In my reproduction, when the f1 score on WebNLG achieves 91.8 reported on the paper, it is hard to exceed 92.5 on WebNLG-SEO but easy to get 94.0 on WebNLG-EPO. It seems that the original paper may have mistaken the score on WebNLG-EPO with the score on WebNLG-SEO. I did a statistic, WebNLG-Normal, WebNLG-EPO, WebNLG-SEO separately have 246, 26, 457 samples and 246, 98, 1345 relation triplets. It is obvious that WebNLG-SEO is the main part, and it has 13 times triplets as WebNLG-EPO and 5 times as WebNLG-Normal. Given that WebNLG-Normal and WebNLG-EPO have a small number of triplets, so it seems little likely to drop the score down from 94.7 to 91.8. So, it seems a little likely that CasRel achieves 91.8 on the entire WebNLG but 94.7 on WebNLG-SEO.
Please check it again, Thanks~
想请教一下,在训练过程中,Relation-specific Object Taggers的输入是Subject Tagger预测的结果还是标准的Subject的边界?如果是预测的结果,那么如果Subject Tagger预测错误,那么损失如何构建?
我完全按照readme的步骤
NYT数据集复现结果是0.8030150753768869(precision.) 0.7881627620221975(recall) 0.7955196017423797(F1)
与论文中的89.7(precision) 89.5(recall) 89.6(F1) 相差甚远,可又不知道问题可能在哪
it says "list indices must be integers or slices, not str"
the sentence is :
if not a['relationMentions']
and i didn't see relationMentions as a int varible
i look forward to your reply
作者在进行关系抽取的时候,有咩有考虑到同一个实体可能在一个句子中出现多次,举个例子,第一个实体参与了关系三元组的组成,而第二个句子没有参与这种情况
Hello, @weizhepei , in your paper you mentioned that you have implemented CopyRE(with Reinforcement Learning)? Could you make it public? Thanks a lot.
请问模型是如何解决的以下问题的呢,真心求解答,在代码里没看出来:
与HBT相比,CasRel模型是在哪里有改动,导致f1又有提升的呢?
Traceback (most recent call last):
File "build_data.py", line 18, in
if not a['relationMentions']:
TypeError: list indices must be integers or slices, not str
How to solve this problem?
{
"text": "Alan Bean ( of the United States ) was a crew member of NASA 's Apollo 12 under the commander David Scott .",
"triple_list": [
[
"Bean",
"was a crew member of",
"12"
],
[
"Bean",
"nationality",
"States"
],
[
"12",
"operator",
"NASA"
],
[
"12",
"commander",
"Scott"
]
比如这个,实体不应该是alan bean和apollo 12嘛?用这样的数据测出来不能让人信服吧?
作者您好,万分感谢您分享的代码!
因为自己能力水平有限,对Tensorflow和Keras不是很了解,所以最近在您代码的基础上实现pytorch版本。但是复现之后的性能差很多。
请问作者有发布Pytorch版本代码的计划么?
Could you share the best model parameters?
论文中的Wiki-KBP 训练集有5万多句,但我根据链接下载的只有2万多句,想问下数据是否争取?
为什么三元组中的头尾实体都是单个单词呢
We can only get a training dataset with 23k sentences from your link, but one with 79k sentences mentioned in your paper. Is there any problem with the link? Please check again. Or could you please send the 79K version training dataset to our email? ([email protected]) Just for a fair comparison, Thank you!
你好,这一行的i和外层循环的i冲突了吧?直接运行程序会报list out of index的错误,请看一下代码
CasRel/data/NYT/raw_NYT/generate.py
Line 53 in e04924c
Hello, I want to know the hyperparameters of the LSTM-based model. I noticed that there are no mentioned in your paper and no code about it in this github project.
Can you share it? Please.
训练轮次多了之后,模型会占用大量的内存空间导致溢出,怀疑是evaluate的问题,有遇到过这情况的吗。
class HBTokenizer(Tokenizer):
def _tokenize(self, text):
if not self._cased:
text = unicodedata.normalize('NFD', text)
text = ''.join([ch for ch in text if unicodedata.category(ch) != 'Mn'])
text = text.lower()
spaced = ''
for ch in text:
if ord(ch) == 0 or ord(ch) == 0xfffd or self._is_control(ch):
continue
else:
spaced += ch
tokens = []
for word in spaced.strip().split():
tokens += self._word_piece_tokenize(word)
tokens.append('[unused1]')
return tokens
请问为什么要加一个'[unused1]'呢
请问这是版本冲突吗
extract_items 获取subject 对应文本的部分是否存在问题? 没有想通,望得到解答,麻烦了~
如下代码中的sub_heads, sub_tails 代表head和tail所有可能位置的candidates
sub_heads_logits, sub_tails_logits = subject_model.predict([token_ids, segment_ids])
sub_heads, sub_tails = np.where(sub_heads_logits[0] > h_bar)[0], np.where(sub_tails_logits[0] > t_bar)[0]
subjects = []
for sub_head in sub_heads:
sub_tail = sub_tails[sub_tails >= sub_head]
if len(sub_tail) > 0:
sub_tail = sub_tail[0]
subject = tokens[sub_head: sub_tail]
subjects.append((subject, sub_head, sub_tail))
从data_loader中代码来看,构建gold_label时,subject 和object 文本span 都是闭区间 (其中是sub_head_idx + len(triple[0]) - 1)
for triple in line['triple_list']:
triple = (self.tokenizer.tokenize(triple[0])[1:-1], triple[1], self.tokenizer.tokenize(triple[2])[1:-1])
sub_head_idx = find_head_idx(tokens, triple[0])
obj_head_idx = find_head_idx(tokens, triple[2])
if sub_head_idx != -1 and obj_head_idx != -1:
sub = (sub_head_idx, sub_head_idx + len(triple[0]) - 1)
if sub not in s2ro_map:
s2ro_map[sub] = []
s2ro_map[sub].append((obj_head_idx,
obj_head_idx + len(triple[2]) - 1,
self.rel2id[triple[1]]))
sub_head, sub_tail = choice(list(s2ro_map.keys()))。为什么采样句子中的subject是随机选取其中一个,这样做的目的是什么呢,谢谢解答
How much time does it take for an epoch? I got a 16 GB GPU.
您好,
请问为什么论文里写的WebNLG中有246种关系,但是下载的数据集里却只看到了170种?
gives:
No module named 'tensorflow.python.framework'
Solution:
tensorflow-gpu 1.15
CasRel/data/NYT/raw_NYT/generate.py
这个代码产生的文件所有的实体都只有一个词。比如New York,会变成York。
对于WebNLG也有同样的问题。
但是项目的README里面写的样本是正常的,说明是文件分享的问题,所以可以麻烦修复一下google drive分享的文件吗?谢谢!
How long does the training process take?
https://drive.google.com/file/d/10f24s9gM7NdyO3z5OqQxJgYud4NnCJg3/view
train.json里面只有一行数字列表。
使用CopyR 的nty数据集。里边在处理entity,只保留了最后一个单词
例如:Suffolk County -> County。
在casrel也使用了这个数据集,通过把数字化的数据集转化回文字,并保存关系三元组。
按照casrel论文,那sub start,end 且不是都指向County。 这个是否存在问题啊?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.