haozheji / multigen Goto Github PK

View Code? Open in Web Editor NEW

127.0 127.0 30.0 6.22 MB

EMNLP2020 - Language Generation with Multi-hop Reasoning on Commonsense Knowledge Graph

Python 100.00%

kg nlg pytorch

multigen's People

Contributors

Stargazers

Watchers

multigen's Issues

你好，想请教一下该模型动态多跳推理流模块的细节

你好，想请问一下在论文3.2.3部分提到的计算未访问节点分数相关细节。1、在生成输出序列的每个时间步，是否计算节点分数都会从初始节点为1其他节点为0开始，还是会在上个时间步的分数基础上更新呢？2、论文4.2部分说到“Specifically, we iterate the following process for H hops: starting from the nodes in the current sub-graph (initialized by Cx) and search for the direct neighbours of each node and preserve topB nodes with the connected edges to enlarge the sub-graph”，表明每一跳会访问top-B个节点，而3.2.2部分提到“Specifically, the module broadcasts information on G by updating the score of outer nodes with their visited neighbours for multiple hops until all the nodes on G are visited.”，表示图G上的所有节点都会被访问，如果每一跳访问top-B个节点，到H跳后不会有节点未被访问吗？

I want to load my vocab. json

Question about relation embeddings

Dear authors,
Why are you setting the size of relational embedding to be 40? (line 733 in modeling_gpt2.py). This embedding layer seems to encode a total number of 'max_triple_size'(800 in default) relations, so what's the meaning of 40?

ISSUE about the data

Hi,

Haozhe,

I found that the data lack the ../data/merge_relation.txt, so can you upload the related data? Thanks!

model_class.from_pretrained(load_from_path, ...)

您好，

当我运行main.py中的上述代码时报错：
2020-11-22 10:20:41 | INFO | transformers.tokenization_utils_base | Model name '../models/gpt2-small' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, distilgpt2). Assuming '../models/gpt2-small' is a path, a model identifier, or url to a directory containing tokenizer files.
2020-11-22 10:20:41 | INFO | transformers.tokenization_utils_base | Didn't find file ../models/gpt2-small/added_tokens.json. We won't load it.
2020-11-22 10:20:41 | INFO | transformers.tokenization_utils_base | Didn't find file ../models/gpt2-small/special_tokens_map.json. We won't load it.
2020-11-22 10:20:41 | INFO | transformers.tokenization_utils_base | Didn't find file ../models/gpt2-small/tokenizer_config.json. We won't load it.
[11/22/2020 10:20:41 - INFO - transformers.tokenization_utils_base] Didn't find file ../models/gpt2-small/tokenizer.json. We won't load it.

OSError: Can't load config for '../models/gpt2-small'. Make sure that:

'../models/gpt2-small' is a correct model identifier listed on 'https://huggingface.co/models'
or '../models/gpt2-small' is the correct path to a directory containing a config.json file

请问gpt2-small这里的文件从哪里下载呢，我已经把readme中提到的放入'gpt2-small/pytorch_model.bin'，但是无法读取到。

Some issues about the data

Hi, haozhe

I found that lack some data, including the concepts_nv.json, vocab.json. Maybe it also lacks some other data.

Thanks!

您好，请问一个不是该模型的问题，也是您曾经在goole-bert提问过的attention mask的问题。

在bert源码create_attention_mask_from_input_mask中，We don't assume that from_tensor is a mask (although it could be). We
don't actually care if we attend from padding tokens (only to padding) tokens so we create a tensor of all ones.这里Query的padding也会得到没有意义的attention scores,后面是否有处理掉他们呢？困扰很久了，感谢

I want to know which version of en_core_web_sm is used

Thank you!

Question about experiment result

Recently, I read your EMNLP paper Language Generation with Multi-Hop Reasoning on Commonsense Knowledge Graph which gives me a lot of inspiration on story generation with knowledge. However, I'm really confused about the experiment results on Story Ending Generation. In your paper, the baseline model IE+GA reaches 20.8 / 6.4 on the BLEU-1/2 score while it can reach 0.2566 / 0.0854 on the BLEU-1/2 in the original paper of IE+MSA(GA) you cited with the same setting.

Experiment results in Language Generation with Multi-Hop Reasoning on Commonsense Knowledge Graph

Experiment results in Story Ending Generation with Incremental Encoding and Commonsense Knowledge

Are there any detailed differences between the experiment settings?
Or do I misunderstand some important things?

Wish you can lift my doubts.

Evaluation of your trained model on different data

Hi,

I want to evaluate the model trained on the anlg data on a separate dataset.
What would be the correct way to proceed with this?

Thank you very much,

Karin

BUG:KeyError: '<|bos|>'

How to deal with vocab.json?
I used add_special_tokens.py,but still can't run!

Why input the concatenation of the source and target sequence into the model?

Why not input only the source sequence?

Thank you~

NotImplementedError

when i run inference code
python3 main.py \ --train_data_file ${ROOT_PATH}/data/${DATA_TYPE}/train \ --dev_data_file ${ROOT_PATH}/data/${DATA_TYPE}/dev \ --test_data_file ${ROOT_PATH}/data/${DATA_TYPE}/test \ --graph_path 2hops_100_directed_triple_filter.json \ --output_dir ${ROOT_PATH}/models/${DATA_TYPE}/grf-${DATA_TYPE} \ --source_length 32 \ --target_length 32 \ --model_type gpt2 \ --model_name_or_path ${ROOT_PATH}/models/gpt2-small \ --do_eval \ --per_gpu_train_batch_size 16 \ --per_gpu_eval_batch_size 16 \ --workers 7 \ --seed 42 \ --evaluate_metrics bleu \ --overwrite_output_dir \ --aggregate_method max \ --gamma 0.5 \

Error
[04/09/2022 21:49:57 - WARNING - root] Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up.
Setting 'max_len_sentences_pair' is now deprecated. This value is automatically set up.
source length: 32
[04/09/2022 21:49:58 - INFO - modeling_gpt2] Tie weights in head!!!!!
[04/09/2022 21:49:58 - INFO - modeling_gpt2] Tie weights in head!!!!!
Traceback (most recent call last):
File "main.py", line 620, in
main()
File "main.py", line 588, in main
model.resize_token_embeddings(len(tokenizer))
File "/home/hx/anaconda3/envs/torch1.4/lib/python3.7/site-packages/transformers/modeling_utils.py", line 724, in resize_token_embeddings
model_embeds = self._resize_token_embeddings(new_num_tokens)
File "/home/hx/anaconda3/envs/torch1.4/lib/python3.7/site-packages/transformers/modeling_utils.py", line 738, in _resize_token_embeddings
old_embeddings = self.get_input_embeddings()
File "/home/hx/anaconda3/envs/torch1.4/lib/python3.7/site-packages/transformers/modeling_utils.py", line 563, in get_input_embeddings
return base_model.get_input_embeddings()
File "/home/hx/anaconda3/envs/torch1.4/lib/python3.7/site-packages/transformers/modeling_utils.py", line 565, in get_input_embeddings
raise NotImplementedError
NotImplementedError

missed file

HI ,
where is "data/conceptnet_en.txt" file in your data.zip

thanks

关于测试时数据处理的问题

作者您好，

请问当测试时，由于看不到target，“concept_label”、“gate_labels”等这些数据应该怎么处理呢？

谢谢！

Issues about application of this model on reddit converation dataset

Dear author,
i have tried to apply the GRF model on reddit dataset. How ever, I found it perform badly on it (similar bleu but low quality according to humam judge). Have you tried to apply the GRF on reddit dataset or other conversation datasets? Is the GRF model suitable for conversation dataset? Are there any tricks to do this job?
looking forward to your reply, thank you!

Why is the txt I generated like this？

result_eptest.txt:

A plane is too large to fit in a grocery store, and a plane is too large to fit in a plane. It is too large to fit in a
the time of the time of the time of the time of the time of the time of the time. time. and the time of the time of
A dragon is too big to be ridden on a dragon's body. and a dragon is too big to be ridden on a dragon's body. and

What caused this？Thank you~

您好，请问代码什么时候能够开源呢

Target data information was also used in testing, is this reasonable?

I noticed that the test dataset also generates 2hops_100_directed_triple_filter.json files based on source and target files, so when I want to test any sentence with the trained model, there is no ready-made target to use, how should I generate a result for any sentence?

What's the difference between your tokenizer and the default GPT tokenizer?

I've tried several english sentences and found that the two tokenizer gave same results.
I have just performed a test on your main.py, using GPT2's default tokenizer and your tokenizer collectively:

if name == 'main':
# main()
tokenizer = GPT2Tokenizer.from_pretrained("../models/gpt2-small/", do_lower_case=False)
print(tokenizer.encode("I don't want to live in America."))
print(tokenizer.decode(tokenizer.encode("I don't want to live in America.")))
print(tokenizer.encode("i don't want to live in america."))
print(tokenizer.decode(tokenizer.encode("i don't want to live in america.")))
print(tokenizer.encode("i don't want to live in america ."))
print(tokenizer.decode(tokenizer.encode("i don't want to live in america .")))
print(tokenizer.encode("i do n't want to live in america ."))
print(tokenizer.decode(tokenizer.encode("i do n't want to live in america .")))

The two tokenizer return the same result. So what's the difference between your tokenizer and the default GPT tokenizer?

error

Hello, your work is excellent, this error occurred in the train step, may I ask where this file is?

Lack of file vocab.json

Hi, I can not find the file: vocab.json under data.zip

haozheji / multigen Goto Github PK

multigen's People

Contributors

Stargazers

Watchers

Forkers

multigen's Issues

Recommend Projects

Recommend Topics

Recommend Org